A CHINESE IMAGE CAPTIONING METHOD BASED ON FUSION ENCODER AND VISUAL KEYWORD SEARCH

Aimed at the problem that the existing image caption models lack attention to the local details of an image and tend to give general description, a Chinese image caption method combining encoder and visual keyword search is proposed. A fusion encoder was constructed, and the local and global features of an image were extracted simultaneously in a convolutional neural network (CNN) to enrich the semantic information of image features in long short-term memory (LSTM) decoding stage. Aimed at the problem of general expression, the image retrieval method based on convolutional neural network was used to find the potential visual words, and was integrated into the word vector generation process in the decoding stage. Reinforcement learning mechanism was introduced to optimize the CIDEr evaluation index at the sentence level to improve the lexical diversity of image description. Experimental results verify the effectiveness of the proposed method.

Keywords

Attention mechanism Chinese image caption Encoder-decoder architecture Image retrieval Reinforcement learning

Tools

Get Citation

Copy Citation Text

Meng Fancong, Xu Wei, Li Haibo, Wu Min, Zheng Junjie, Chen Xing. A CHINESE IMAGE CAPTIONING METHOD BASED ON FUSION ENCODER AND VISUAL KEYWORD SEARCH[J]. Computer Applications and Software, 2025, 42(4): 208

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category:

Received: Nov. 15, 2021

Accepted: Aug. 25, 2025

Published Online: Aug. 25, 2025

The Author Email:

DOI:10.3969/j.issn.1000-386x.2025.04.030

Topics