Cross-modal image and text retrieval based on graph convolution and multi-head attention

HUA Chunjian; ZHANG Hongtu; JIANG Yi; YU Jianfeng; CHEN Ying

doi:10.16136/j.joel.2024.09.0025

Journal of Optoelectronics · Laser, Volume. 35, Issue 9, 925(2024)

Cross-modal image and text retrieval based on graph convolution and multi-head attention

HUA Chunjian^1,2, ZHANG Hongtu^1,2, JIANG Yi^1,2, YU Jianfeng^1,2, and CHEN Ying³

Author Affiliations

¹School of Mechanical Engineering, Jiangnan University, Wuxi, Jiangsu 214122, China

²Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment & Technology, Wuxi, Jiangsu 214122, China

³School of Internet of Things Engineering, Jiangnan University, Wuxi, Jiangsu 214122, China

show less

Abstract Get PDF(in Chinese)

Aiming at the problem that the existing cross-modal retrieval methods are difficult to measure the weight of data at each node, and there are limitations in mining local consistency within modalities, a cross-modal image and text retrieval method based on multi-head attention mechanism is proposed. Firstly, a single image and text sample serves as an independent node when constructing the modal diagram, and graph convolution is used to extract the interaction information between each sample to improve the local consistency in different modal data. Then, attention mechanism is introduced into graph convolution to adaptively learn the weight coefficients of each neighboring node, thereby distinguishing the influence of different neighboring nodes on the central node. Finally, a multi-head attention layer with weight parameters is constructed to fully learn multiple sets of related features between nodes. Compared with the existing 8 methods, the mAP values obtained by this method in experiments on the Wikipedia dataset and Pascal Sentence dataset increase by 2.6% to 42.5% and 3.3% to 54.3%, respectively.

Keywords

adjacency matrix attention weight common subspace cross-modal retrieval multi-head attention

Tools

Get Citation

Copy Citation Text

HUA Chunjian, ZHANG Hongtu, JIANG Yi, YU Jianfeng, CHEN Ying. Cross-modal image and text retrieval based on graph convolution and multi-head attention[J]. Journal of Optoelectronics · Laser, 2024, 35(9): 925

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category:

Received: Feb. 3, 2023

Accepted: Dec. 20, 2024

Published Online: Dec. 20, 2024

The Author Email:

DOI:10.16136/j.joel.2024.09.0025

Topics

laser devices and laser physics

Lasers and Laser Optics

Laser physics

laser manufacturing

Instrumentation, Measurement and Metrology