A CHINESE IMAGE CAPTIONING METHOD BASED ON FUSION ENCODER AND VISUAL KEYWORD SEARCH

[1] [1] Li P J, Ma J, Gao S. Learning to summarize web image and text mutually[C]//2nd ACM International Conference on Multimedia Retrieval, 2012: 28-36.

[2] [2] Ordonez V, Kulkarni G, Berg T L. Im2Text: Describing images using 1 million captioned photographs[C]//25th Annual Conference on Neural Information Processing Systems, 2011: 1143-1151.

[3] [3] Mason R, Charniak E. Nonparametric method for data-driven image captioning[C]//52nd Annual Meeting of the Association for Computational Linguistics, 2014: 592-598.

[4] [4] Farhadi A, Hejrati S M, Sadeghi M A, et al. Every picture tells a story: Generating sentences from images[J]. Lecture Notes in Computer Science, 2010, 6314(1): 15-29.

[5] [5] Farhadi A, Endres I, Hoiem D, et al. Describing objects by their attributes[C]//IEEE Conference on Computer Vision and Pattern Recognition, 2009: 1778-1785.

[6] [6] Li S M, Kulkarni G, Berg T L, et al. Composing simple image descriptions using web-scale N-grams[C]//15th Conference on Computational Natural Language Learning, 2011: 220 -228.

[7] [7] Vinyals O, Toshev A, Bengio S, et al. Show and Tell: A neural image caption generator[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015, 7: 3156-3164.

[8] [8] Karpathy A, Li F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676.

[9] [9] Donahue J, Hendricks L A, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 677-691.

[11] [11] Mao J H, Xu W, Yang Y, et al. Deep captioning with multimodal recurrent neural networks (m-RNN)[EB]. arXiv: 1412.6632, 2014.

[12] [12] Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//32nd International Conference on International Conference on Machine, 2015: 2048-2057.

[13] [13] Lu J, Xiong C, Parikh D, et al. Knowing when to look: A-daptive attention via a visual sentinel for image captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition, 2017: 3242-3250.

[14] [14] Johnson J, Karpathy A, Li F. DenseCap: Fully convolutional localization networks for dense captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition, 2016: 4565-4574.

[15] [15] Lu J, Yang J W, Batra D, et al. Neural baby talk[EB]. arXiv: 1803.09845, 2018.

[16] [16] Anderson P, He X D, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017: 6077-6086.

[17] [17] Ren S Q, He K M, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.

[18] [18] Li X R, Lan W Y, Dong J F, et al. Adding Chinese captions to images[C]//ACM on International Conference on Multimedia Retrieval, 2016: 271-275.

[19] [19] Wu J H, Zheng H, Zhao B, et al. AI challenger: A largescale dataset for going deeper in image understanding[EB]. arXiv: 1711.06475, 2017.

[23] [23] Zhu S Y, Ng I, Chen Z. Causal discovery with reinforcement learning[EB]. arXiv: 1906.04477, 2019.

[24] [24] Chen C, Mu S, Xiao W P, et al. Improving image captioning with conditional generative adversarial nets[C]//33rd AAAI Conference on Artificial Intelligence, 2019: 8142-8150.

[25] [25] Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation[C]//40th Annual meeting of the Association for Computational Linguistics, 2002: 311-318.

[26] [26] Lin C Y. ROUGE: A package for automatic evaluation of summaries[C]//42nd Annual Meeting of the Association for Computational Linguistics, 2004: 74-81.

[27] [27] Vedantam R, Zitnick C L, Parikh D. CIDEr: Consensus-based image description evaluation[C]//IEEE Conference on Computer Vision and Pattern Recognition, 2015: 4566-4575.

[28] [28] Robertson S. Understanding inverse document frequency: On the oretical arguments for IDF[J]. Journal of Documentation, 2004, 60(5): 503-520.

Tools

Get Citation

Copy Citation Text

Meng Fancong, Xu Wei, Li Haibo, Wu Min, Zheng Junjie, Chen Xing. A CHINESE IMAGE CAPTIONING METHOD BASED ON FUSION ENCODER AND VISUAL KEYWORD SEARCH[J]. Computer Applications and Software, 2025, 42(4): 208

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category:

Received: Nov. 15, 2021

Accepted: Aug. 25, 2025

Published Online: Aug. 25, 2025

The Author Email:

DOI:10.3969/j.issn.1000-386x.2025.04.030

Topics