Research on image captioning generation method of double information flow based on ECA-Net

LIU Zhongmin; SU Rong; HU Wenjin

doi:10.16136/j.joel.2025.01.0322

Journal of Optoelectronics · Laser, Volume. 36, Issue 1, 27(2025)

Research on image captioning generation method of double information flow based on ECA-Net

LIU Zhongmin^1,2、*, SU Rong^1,2, and HU Wenjin³

Author Affiliations

¹College of Electrical and Information Engineering, Lanzhou University of Technology, Lanzhou 730050, China

²Key Laboratory of Gansu Advanced Control for Industrial Processes, Lanzhou, Gansu 730050, China

³College of Mathematics and Computer Science, Northwest Minzu University, Lanzhou, Gansu 730030, China

show less

Abstract Get PDF(in Chinese)

References(27)

[1] [1] BAI S, AN S. A survey on automatic image caption generation［J］. Neurocomputing, 2018, 311:291-304.

[2] [2] FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: Generating sentences from images［C］//European Conference on Computer Vision(ECCV), September 5-11, 2010, Heraklion, Crete, Greece. Singapore: Springer, 2010:15-29.

[3] [3] GUPTA A, VERMA Y, JAWAHAR C. Choosing linguistics over vision to describe images［C］//2012 AAAI Conference on Artificial Intelligence, July 22-26, 2012, Toronto, Ontario, Canada. Menlo Park: AAAI, 2012, 26(1):606-612.

[4] [4] LI S M, KULKARNI G, BERG T, et al. Composing simple image descriptions using web-scale N-grams［C］//15th Conference on Computational Natural Language Learning, June 23-24, 2011, Portland, Oregon, USA. Stroudsburg: ACL, 2011:220-228.

[5] [5] KIROS R, SALAKHUTDINOV R, ZEMEL R. Multimodal neural language models［C］//31st International Conference on Machine Learning, June 21-26, 2014, Beijing, China. New York: ACM, 2014:595-603.

[6] [6] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention［C］//32nd International Conference on Machine Learning, July 6-11, 2015, Lille, France. New York: ACM, 2015:2048-2057.

[7] [7] PAN X R, GE C J, LU R, et al. On the integration of self-attention and convolution［C］//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 18-24, 2022, New Orleans, LA, USA. New York: IEEE, 2022:815-825.

[8] [8] MA Y W, JI J Y, SUN X S, et al. Towards local visual modeling for image captioning［J］. Pattern Recognition, 2023, 138:109420.

[9] [9] GRAVES A. Generating sequences with recurrent neural networks［EB/OL］. (2013-08-04) ［2023-06-18］. https://arxiv.org/abs/1308.0850.

[10] [10] WU M R, ZHANG X Y, SUN X S, et al. Difnet: Boosting visual information flow for image captioning［C］//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 18-24, 2022, New Orleans, LA, USA. New York: IEEE, 2022:18020-18029.

[11] [11] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need［C］//Advances in Neural Information Processing Systems, December 4-9, 2017, Long Beach, California, USA. Red Hook: Currant Associaes Inc. , 2017, 30:6000-6010.

[12] [12] WANG Q L, WU B G, ZHU P F, et al. ECA-Net: Efficient channel attention for deep convolutional neural networks［C］//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 13-19, 2020, Seattle, WA, USA. New York: IEEE, 2020:11534-11542.

[13] [13] NAGRANI A, YANG S, ARNAB A, et al. Attention bottlenecks for multimodal fusion［C］//Advances in Neural Information Processing Systems, December 6-14, 2021, Red Hook: Currant Associaes Inc. , 2021, 34:14200-14213.

[14] [14] LUO R. A better variant of self-critical sequence training［EB/OL］. (2020-03-22) ［2023-06-18］. https://arxiv.org/abs/2003.09971.

[15] [15] CHEN X, FANG H, LIN T Y, et al. Microsoft coco captions: Data collection and evaluation server［EB/OL］. (2015-04-01) ［2023-06-18］. https://arxiv.org/abs/1504.00325.

[16] [16] KARPATHY A, JOHNSON J, FEI-FEI L, et al. Visualizing and understanding recurrent networks［EB/OL］. (2015-06-05) ［2023-06-18］. https://arxiv.org/abs/1506.02078.

[17] [17] WANG E K, ZHANG X, WANG F, et al. Multilayer dense attention model for image caption［J］. IEEE Access, 2019, 7:66358-66368.

[18] [18] XIONG Y W, LIAO R J, ZHAO H S, et al. UPSNet: A unified panoptic segmentation network［C］//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 15-20, 2019, Long Beach, CA, USA. New York: IEEE, 2019:8818-8826.

[19] [19] KINGMA D P, BA J. Adam: A method for stochastic optimization［EB/OL］. (2014-12-22) ［2023-06-18］. https://arxiv.org/abs/1412.6980.

[20] [20] FU H X, SONG G Q, WANG Y C. Improved YOLOv4 marine target detection combined with CBAM［J］. Symmetry, 2021, 13(4):623.

[21] [21] MA X, GUO J D, SANSOM A, et al. Spatial pyramid attention for deep convolutional neural networks［J］. IEEE Transactions on Multimedia, 2021, 23:3048-3058.

[22] [22] HU J, SHEN L, SUN G, et al. Squeeze-and-excitation networks［C］//2018 IEEE/CVF Conference on Computer Vision and Pattern recognition, June 18-23, 2018, Salt Lake City, UT, USA. New York: IEEE, 2018:7132-7141.

[23] [23] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning［C］//2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition, July 21-26, 2017, Honolulu, HI, USA. New York: IEEE, 2017:7008-7024.

[24] [24] PAN Y W, LI Y H, YAO T, et al. Bottom-up and top-down object inference networks for image captioning［J］. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(5):1-18.

[25] [25] CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning［C］//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 13-19, 2020, Seattle, WA, USA. New York: IEEE, 2020:10578-10587.

[26] [26] PAN Y W, YAO T, LI Y H, et al. X-linear attention networks for image captioning［C］//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 13-19, 2020, Seattle, WA, USA. New York: IEEE, 2020:10971-10980.

[27] [27] ZHANG X Y, SUN X S, LUO Y P, et al. RSTNet: Captioning with adaptive attention on visual and non-visual words［C］//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 20-25, 2021, Nashville, TN, USA. New York: IEEE, 2021:15465-15474.

Tools

Get Citation

Copy Citation Text

LIU Zhongmin, SU Rong, HU Wenjin. Research on image captioning generation method of double information flow based on ECA-Net[J]. Journal of Optoelectronics · Laser, 2025, 36(1): 27

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category:

Received: Jun. 18, 2023

Accepted: Jan. 23, 2025

Published Online: Jan. 23, 2025

The Author Email: LIU Zhongmin (liuzhmx@163.com)

DOI:10.16136/j.joel.2025.01.0322

Topics

laser devices and laser physics

Lasers and Laser Optics

Laser physics

laser manufacturing

Instrumentation, Measurement and Metrology