Journal of Optoelectronics · Laser, Volume. 35, Issue 9, 925(2024)
Cross-modal image and text retrieval based on graph convolution and multi-head attention
Aiming at the problem that the existing cross-modal retrieval methods are difficult to measure the weight of data at each node, and there are limitations in mining local consistency within modalities, a cross-modal image and text retrieval method based on multi-head attention mechanism is proposed. Firstly, a single image and text sample serves as an independent node when constructing the modal diagram, and graph convolution is used to extract the interaction information between each sample to improve the local consistency in different modal data. Then, attention mechanism is introduced into graph convolution to adaptively learn the weight coefficients of each neighboring node, thereby distinguishing the influence of different neighboring nodes on the central node. Finally, a multi-head attention layer with weight parameters is constructed to fully learn multiple sets of related features between nodes. Compared with the existing 8 methods, the mAP values obtained by this method in experiments on the Wikipedia dataset and Pascal Sentence dataset increase by 2.6% to 42.5% and 3.3% to 54.3%, respectively.
Get Citation
Copy Citation Text
HUA Chunjian, ZHANG Hongtu, JIANG Yi, YU Jianfeng, CHEN Ying. Cross-modal image and text retrieval based on graph convolution and multi-head attention[J]. Journal of Optoelectronics · Laser, 2024, 35(9): 925
Category:
Received: Feb. 3, 2023
Accepted: Dec. 20, 2024
Published Online: Dec. 20, 2024
The Author Email: