Journal of Terahertz Science and Electronic Information Technology , Volume. 19, Issue 1, 156(2021)
Visual Question Answering based on multimodal bidirectional guided attention
Aiming at the problem that the existing deep collaborative attention models in the Visual Question Answering(VQA) task only consider the unidirectional attention of the question-guided image, which leads to the lack of interactivity of multimodal learning, a multimodal bidirectional guided attention network is proposed. The network consists of multimodal feature extraction module, bidirectional guided attention module, feature fusion module and classifier. The extracted image and question features are respectively output with weighted attention features after passing through layers of attention, and then the features are linearly merged into the softmax classifier to obtain the predicted answer to the question. Finally, the counting module is combined to improve the counting ability of the model. The results show that the model performs well on the public data set VQA v2.0, and obtains an overall classification accuracy of 70.77% and 71.28% on the test_dev and test_std, respectively, showing certain advantages compared with most advanced models.
Get Citation
Copy Citation Text
XIAN Rong, HE Xiaohai, WU Xiaohong, QING Linbo. Visual Question Answering based on multimodal bidirectional guided attention[J]. Journal of Terahertz Science and Electronic Information Technology , 2021, 19(1): 156
Category:
Received: Apr. 24, 2020
Accepted: --
Published Online: Apr. 21, 2021
The Author Email: Rong XIAN (874293694@qq.com)