Chained semantic generation network for video captioning

[1] [1] 1汤鹏杰，王瀚漓. 从视频到语言：视频标题生成与描述研究综述［J］. 自动化学报， 2022， 48（2）： 375-397.TANGP J， WANGH L. From video to language： survey of video captioning and description［J］. Acta Automatica Sinica， 2022， 48（2）： 375-397.（in Chinese）

[2] GU J X, WANG Z H, KUEN J et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 77, 354-377(2018).

[3] ELMAN J L. Finding structure in time[J]. Cognitive Science, 14, 179-211(1990).

[4] VENUGOPALAN S, ROHRBACH M, DONAHUE J et al. Sequence to sequence： video to text[C], 4534-4542(2015).

[5] HOCHREITER S, SCHMIDHUBER J. Long Short-Term Memory[J]. Neural Computation, 9, 1735-1780(1997).

[6] [6] 6陈科峻，张叶. 循环神经网络多标签航空图像分类［J］. 光学精密工程，2020，28（6）：1404-1413.CHENK J， ZHANGY. Recurrent neural network multi-label aerial images classification［J］. Opt.Precision Eng.， 2020， 28（6）： 1404-1413. （in Chinese）

[7] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 9, 1735-1780(1997).

[8] YAN C G, TU Y B, WANG X Z et al. STAT： spatial-temporal attention mechanism for video captioning[J]. IEEE Transactions on Multimedia, 22, 229-241(2020).

[9] CHEN H R, LIN K, MAYE A et al. A semantics-assisted video captioning model trained with scheduled sampling[J]. Frontiers in Robotics and AI, 7, 475767(2020).

[10] ZHANG Z Q, SHI Y Y, YUAN C F et al. Object relational graph with teacher-recommended learning for video captioning[C], 13275-13285(2020).

[11] PEI W J, ZHANG J Y, WANG X R et al. Memory-attended recurrent network for video captioning[C], 8339-8348(2019).

[12] [12] 12赵海英，周伟，侯小刚，等. 多标签分类的传统民族服饰纹样图像语义理解［J］. 光学精密工程， 2020， 28（3）： 695-703. doi: 10.3788/OPE.20202803.0695ZHAOH Y， ZHOUW， HOUX G， et al. Multi-label classification of traditional national costume pattern image semantic understanding［J］. Opt. Precision Eng.， 2020， 28（3）： 695-703.（in Chinese）. doi: 10.3788/OPE.20202803.0695

[13] SHETTY R, LAAKSONEN J T. Video captioning with recurrent networks based on frame- and video-level features and visual content classification[J]. arXiv preprint(02949).

[14] TU Y B, ZHANG X S, LIU B T et al. Video description with spatial-temporal attention[C], 1014-1022(2017).

[15] GAN Z, GAN C, HE X D et al. Semantic compositional networks for visual captioning[C], 1141-1150(2017).

[16] [16] 16杨其利，周炳红，郑伟，等. 注意力卷积长短时记忆网络的弱小目标轨迹检测［J］. 光学精密工程， 2020， 28（11）： 2535-2548. doi: 10.37188/OPE.20202811.2535YANGQ L， ZHOUB H， ZHENGW， et al. Trajectory detection of small targets based on convolutional long short-term memory with attention mechanisms［J］. Opt. Precision Eng.， 2020， 28（11）： 2535-2548.（in Chinese）. doi: 10.37188/OPE.20202811.2535

[17] XU J, MEI T, YAO T et al. MSR-VTT： a large video description dataset for bridging video and language[C], 5288-5296(2016).

[18] GUADARRAMA S, KRISHNAMOORTHY N, MALKARNENKAR G et al. YouTube2Text： recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C], 2712-2719(2013).

[19] ZOLFAGHARI M, SINGH K, BROX T. ECO： efficient convolutional network for online video understanding[C], 713-730(2018).

[20] XIE S N, GIRSHICK R, DOLLÁR P et al. Aggregated residual transformations for deep neural networks[C], 5987-5995(2017).

[21] PASUNURU R, BANSAL M. Multi-task video captioning with video and entailment generation[C]. Canada. Stroudsburg, 1273-1283(2017).

[22] PASUNURU R, BANSAL M. Reinforced video captioning with entailment rewards[C], 979-985(2017).

[23] LIU S, REN Z, YUAN J S. SibNet： sibling convolutional encoder for video captioning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 3259-3272(2021).

[24] WANG X, WANG Y F, WANG W Y. Watch， listen， and describe： globally and locally aligned cross-modal attentions for video captioning[C], 795-801(2018).

[25] WANG X, WU J W, ZHANG D et al. Learning to compose topic-aware mixture of experts for zero-shot video captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8965-8972(2019).

[26] WANG B R, MA L, ZHANG W et al. Controllable video captioning with POS sequence guidance based on gated fusion network[C], 2641-2650(2019).

[27] HOU J Y, WU X X, ZHAO W T et al. Joint syntax representation learning and visual cue translation for video captioning[C], 8917-8926(2019).

[28] AAFAQ N, AKHTAR N, LIU W et al. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning[C], 12479-12488(2019).

[29] PAN B X, CAI H Y, HUANG D A et al. Spatio-temporal graph for video captioning with knowledge distillation[C], 10867-10876(2020).

[30] ZHENG Q, WANG C Y, TAO D C. Syntax-aware action targeting for video captioning[C], 13093-13102(2020).

[31] PAN Y W, MEI T, YAO T et al. Jointly modeling embedding and translation to bridge video and language[C], 4594-4602(2016).

[32] YU H N, WANG J, HUANG Z H et al. Video paragraph captioning using hierarchical recurrent neural networks[C], 4584-4593(2016).

[33] GAO L L, GUO Z, ZHANG H W et al. Video captioning with attention-based LSTM and semantic consistency[J]. IEEE Transactions on Multimedia, 19, 2045-2055(2017).

Tools

Get Citation

Copy Citation Text

Lin MAO, Hang GAO, Dawei YANG, Rubo ZHANG. Chained semantic generation network for video captioning[J]. Optics and Precision Engineering, 2022, 30(24): 3198

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category: Information Sciences

Received: May. 16, 2022

Accepted: --

Published Online: Feb. 15, 2023

The Author Email: Hang GAO (gao_hang2021@163.com)

DOI:10.37188/OPE.20223024.3198

Topics

laser devices and laser physics

Lasers and Laser Optics

Laser physics

laser manufacturing

Instrumentation, Measurement and Metrology