Computer Applications and Software, Volume. 42, Issue 4, 208(2025)

A CHINESE IMAGE CAPTIONING METHOD BASED ON FUSION ENCODER AND VISUAL KEYWORD SEARCH

Meng Fancong1, Xu Wei1, Li Haibo1, Wu Min1, Zheng Junjie1, and Chen Xing2
Author Affiliations
  • 1East China Yixing Pumped Storage Power Co., Ltd., Wuxi 214200, Jiangsu, China
  • 2College of Computer and Information, Hohai University, Nanjing 211100, Jiangsu, China
  • show less
    References(24)

    [1] [1] Li P J, Ma J, Gao S. Learning to summarize web image and text mutually[C]//2nd ACM International Conference on Multimedia Retrieval, 2012: 28-36.

    [2] [2] Ordonez V, Kulkarni G, Berg T L. Im2Text: Describing images using 1 million captioned photographs[C]//25th Annual Conference on Neural Information Processing Systems, 2011: 1143-1151.

    [3] [3] Mason R, Charniak E. Nonparametric method for data-driven image captioning[C]//52nd Annual Meeting of the Association for Computational Linguistics, 2014: 592-598.

    [4] [4] Farhadi A, Hejrati S M, Sadeghi M A, et al. Every picture tells a story: Generating sentences from images[J]. Lecture Notes in Computer Science, 2010, 6314(1): 15-29.

    [5] [5] Farhadi A, Endres I, Hoiem D, et al. Describing objects by their attributes[C]//IEEE Conference on Computer Vision and Pattern Recognition, 2009: 1778-1785.

    [6] [6] Li S M, Kulkarni G, Berg T L, et al. Composing simple image descriptions using web-scale N-grams[C]//15th Conference on Computational Natural Language Learning, 2011: 220 -228.

    [7] [7] Vinyals O, Toshev A, Bengio S, et al. Show and Tell: A neural image caption generator[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015, 7: 3156-3164.

    [8] [8] Karpathy A, Li F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676.

    [9] [9] Donahue J, Hendricks L A, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 677-691.

    [11] [11] Mao J H, Xu W, Yang Y, et al. Deep captioning with multimodal recurrent neural networks (m-RNN)[EB]. arXiv: 1412.6632, 2014.

    [12] [12] Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//32nd International Conference on International Conference on Machine, 2015: 2048-2057.

    [13] [13] Lu J, Xiong C, Parikh D, et al. Knowing when to look: A-daptive attention via a visual sentinel for image captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition, 2017: 3242-3250.

    [14] [14] Johnson J, Karpathy A, Li F. DenseCap: Fully convolutional localization networks for dense captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition, 2016: 4565-4574.

    [15] [15] Lu J, Yang J W, Batra D, et al. Neural baby talk[EB]. arXiv: 1803.09845, 2018.

    [16] [16] Anderson P, He X D, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017: 6077-6086.

    [17] [17] Ren S Q, He K M, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.

    [18] [18] Li X R, Lan W Y, Dong J F, et al. Adding Chinese captions to images[C]//ACM on International Conference on Multimedia Retrieval, 2016: 271-275.

    [19] [19] Wu J H, Zheng H, Zhao B, et al. AI challenger: A largescale dataset for going deeper in image understanding[EB]. arXiv: 1711.06475, 2017.

    [23] [23] Zhu S Y, Ng I, Chen Z. Causal discovery with reinforcement learning[EB]. arXiv: 1906.04477, 2019.

    [24] [24] Chen C, Mu S, Xiao W P, et al. Improving image captioning with conditional generative adversarial nets[C]//33rd AAAI Conference on Artificial Intelligence, 2019: 8142-8150.

    [25] [25] Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation[C]//40th Annual meeting of the Association for Computational Linguistics, 2002: 311-318.

    [26] [26] Lin C Y. ROUGE: A package for automatic evaluation of summaries[C]//42nd Annual Meeting of the Association for Computational Linguistics, 2004: 74-81.

    [27] [27] Vedantam R, Zitnick C L, Parikh D. CIDEr: Consensus-based image description evaluation[C]//IEEE Conference on Computer Vision and Pattern Recognition, 2015: 4566-4575.

    [28] [28] Robertson S. Understanding inverse document frequency: On the oretical arguments for IDF[J]. Journal of Documentation, 2004, 60(5): 503-520.

    Tools

    Get Citation

    Copy Citation Text

    Meng Fancong, Xu Wei, Li Haibo, Wu Min, Zheng Junjie, Chen Xing. A CHINESE IMAGE CAPTIONING METHOD BASED ON FUSION ENCODER AND VISUAL KEYWORD SEARCH[J]. Computer Applications and Software, 2025, 42(4): 208

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category:

    Received: Nov. 15, 2021

    Accepted: Aug. 25, 2025

    Published Online: Aug. 25, 2025

    The Author Email:

    DOI:10.3969/j.issn.1000-386x.2025.04.030

    Topics