Journal of Terahertz Science and Electronic Information Technology , Volume. 19, Issue 1, 156(2021)

Visual Question Answering based on multimodal bidirectional guided attention

XIAN Rong*, HE Xiaohai, WU Xiaohong, and QING Linbo
Author Affiliations
  • [in Chinese]
  • show less

    Aiming at the problem that the existing deep collaborative attention models in the Visual Question Answering(VQA) task only consider the unidirectional attention of the question-guided image, which leads to the lack of interactivity of multimodal learning, a multimodal bidirectional guided attention network is proposed. The network consists of multimodal feature extraction module, bidirectional guided attention module, feature fusion module and classifier. The extracted image and question features are respectively output with weighted attention features after passing through layers of attention, and then the features are linearly merged into the softmax classifier to obtain the predicted answer to the question. Finally, the counting module is combined to improve the counting ability of the model. The results show that the model performs well on the public data set VQA v2.0, and obtains an overall classification accuracy of 70.77% and 71.28% on the test_dev and test_std, respectively, showing certain advantages compared with most advanced models.

    Tools

    Get Citation

    Copy Citation Text

    XIAN Rong, HE Xiaohai, WU Xiaohong, QING Linbo. Visual Question Answering based on multimodal bidirectional guided attention[J]. Journal of Terahertz Science and Electronic Information Technology , 2021, 19(1): 156

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category:

    Received: Apr. 24, 2020

    Accepted: --

    Published Online: Apr. 21, 2021

    The Author Email: Rong XIAN (874293694@qq.com)

    DOI:10.11805/tkyda2020172

    Topics