Optics and Precision Engineering, Volume. 30, Issue 24, 3198(2022)

Chained semantic generation network for video captioning

Lin MAO, Hang GAO*, Dawei YANG, and Rubo ZHANG
Author Affiliations
  • School of Electromechanical Engineering, Dalian Minzu University, Dalian116600, China
  • show less
    References(33)

    [1] [1] 1汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述[J]. 自动化学报, 2022, 48(2): 375-397.TANGP J, WANGH L. From video to language: survey of video captioning and description[J]. Acta Automatica Sinica, 2022, 48(2): 375-397.(in Chinese)

    [2] GU J X, WANG Z H, KUEN J et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 77, 354-377(2018).

    [3] ELMAN J L. Finding structure in time[J]. Cognitive Science, 14, 179-211(1990).

    [4] VENUGOPALAN S, ROHRBACH M, DONAHUE J et al. Sequence to sequence: video to text[C], 4534-4542(2015).

    [5] HOCHREITER S, SCHMIDHUBER J. Long Short-Term Memory[J]. Neural Computation, 9, 1735-1780(1997).

    [6] [6] 6陈科峻,张叶. 循环神经网络多标签航空图像分类[J]. 光学 精密工程,2020,28(6):1404-1413.CHENK J, ZHANGY. Recurrent neural network multi-label aerial images classification[J]. Opt.Precision Eng., 2020, 28(6): 1404-1413. (in Chinese)

    [7] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 9, 1735-1780(1997).

    [8] YAN C G, TU Y B, WANG X Z et al. STAT: spatial-temporal attention mechanism for video captioning[J]. IEEE Transactions on Multimedia, 22, 229-241(2020).

    [9] CHEN H R, LIN K, MAYE A et al. A semantics-assisted video captioning model trained with scheduled sampling[J]. Frontiers in Robotics and AI, 7, 475767(2020).

    [10] ZHANG Z Q, SHI Y Y, YUAN C F et al. Object relational graph with teacher-recommended learning for video captioning[C], 13275-13285(2020).

    [11] PEI W J, ZHANG J Y, WANG X R et al. Memory-attended recurrent network for video captioning[C], 8339-8348(2019).

    [12] [12] 12赵海英, 周伟, 侯小刚, 等. 多标签分类的传统民族服饰纹样图像语义理解[J]. 光学 精密工程, 2020, 28(3): 695-703. doi: 10.3788/OPE.20202803.0695ZHAOH Y, ZHOUW, HOUX G, et al. Multi-label classification of traditional national costume pattern image semantic understanding[J]. Opt. Precision Eng., 2020, 28(3): 695-703.(in Chinese). doi: 10.3788/OPE.20202803.0695

    [13] SHETTY R, LAAKSONEN J T. Video captioning with recurrent networks based on frame- and video-level features and visual content classification[J]. arXiv preprint(02949).

    [14] TU Y B, ZHANG X S, LIU B T et al. Video description with spatial-temporal attention[C], 1014-1022(2017).

    [15] GAN Z, GAN C, HE X D et al. Semantic compositional networks for visual captioning[C], 1141-1150(2017).

    [16] [16] 16杨其利, 周炳红, 郑伟, 等. 注意力卷积长短时记忆网络的弱小目标轨迹检测[J]. 光学 精密工程, 2020, 28(11): 2535-2548. doi: 10.37188/OPE.20202811.2535YANGQ L, ZHOUB H, ZHENGW, et al. Trajectory detection of small targets based on convolutional long short-term memory with attention mechanisms[J]. Opt. Precision Eng., 2020, 28(11): 2535-2548.(in Chinese). doi: 10.37188/OPE.20202811.2535

    [17] XU J, MEI T, YAO T et al. MSR-VTT: a large video description dataset for bridging video and language[C], 5288-5296(2016).

    [18] GUADARRAMA S, KRISHNAMOORTHY N, MALKARNENKAR G et al. YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C], 2712-2719(2013).

    [19] ZOLFAGHARI M, SINGH K, BROX T. ECO: efficient convolutional network for online video understanding[C], 713-730(2018).

    [20] XIE S N, GIRSHICK R, DOLLÁR P et al. Aggregated residual transformations for deep neural networks[C], 5987-5995(2017).

    [21] PASUNURU R, BANSAL M. Multi-task video captioning with video and entailment generation[C]. Canada. Stroudsburg, 1273-1283(2017).

    [22] PASUNURU R, BANSAL M. Reinforced video captioning with entailment rewards[C], 979-985(2017).

    [23] LIU S, REN Z, YUAN J S. SibNet: sibling convolutional encoder for video captioning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 3259-3272(2021).

    [24] WANG X, WANG Y F, WANG W Y. Watch, listen, and describe: globally and locally aligned cross-modal attentions for video captioning[C], 795-801(2018).

    [25] WANG X, WU J W, ZHANG D et al. Learning to compose topic-aware mixture of experts for zero-shot video captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8965-8972(2019).

    [26] WANG B R, MA L, ZHANG W et al. Controllable video captioning with POS sequence guidance based on gated fusion network[C], 2641-2650(2019).

    [27] HOU J Y, WU X X, ZHAO W T et al. Joint syntax representation learning and visual cue translation for video captioning[C], 8917-8926(2019).

    [28] AAFAQ N, AKHTAR N, LIU W et al. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning[C], 12479-12488(2019).

    [29] PAN B X, CAI H Y, HUANG D A et al. Spatio-temporal graph for video captioning with knowledge distillation[C], 10867-10876(2020).

    [30] ZHENG Q, WANG C Y, TAO D C. Syntax-aware action targeting for video captioning[C], 13093-13102(2020).

    [31] PAN Y W, MEI T, YAO T et al. Jointly modeling embedding and translation to bridge video and language[C], 4594-4602(2016).

    [32] YU H N, WANG J, HUANG Z H et al. Video paragraph captioning using hierarchical recurrent neural networks[C], 4584-4593(2016).

    [33] GAO L L, GUO Z, ZHANG H W et al. Video captioning with attention-based LSTM and semantic consistency[J]. IEEE Transactions on Multimedia, 19, 2045-2055(2017).

    Tools

    Get Citation

    Copy Citation Text

    Lin MAO, Hang GAO, Dawei YANG, Rubo ZHANG. Chained semantic generation network for video captioning[J]. Optics and Precision Engineering, 2022, 30(24): 3198

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Information Sciences

    Received: May. 16, 2022

    Accepted: --

    Published Online: Feb. 15, 2023

    The Author Email: Hang GAO (gao_hang2021@163.com)

    DOI:10.37188/OPE.20223024.3198

    Topics