Optics and Precision Engineering, Volume. 30, Issue 24, 3198(2022)
Chained semantic generation network for video captioning
[1] [1] 1汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述[J]. 自动化学报, 2022, 48(2): 375-397.TANGP J, WANGH L. From video to language: survey of video captioning and description[J]. Acta Automatica Sinica, 2022, 48(2): 375-397.(in Chinese)
[2] GU J X, WANG Z H, KUEN J et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 77, 354-377(2018).
[3] ELMAN J L. Finding structure in time[J]. Cognitive Science, 14, 179-211(1990).
[4] VENUGOPALAN S, ROHRBACH M, DONAHUE J et al. Sequence to sequence: video to text[C], 4534-4542(2015).
[5] HOCHREITER S, SCHMIDHUBER J. Long Short-Term Memory[J]. Neural Computation, 9, 1735-1780(1997).
[6] [6] 6陈科峻,张叶. 循环神经网络多标签航空图像分类[J]. 光学 精密工程,2020,28(6):1404-1413.CHENK J, ZHANGY. Recurrent neural network multi-label aerial images classification[J]. Opt.Precision Eng., 2020, 28(6): 1404-1413. (in Chinese)
[7] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 9, 1735-1780(1997).
[8] YAN C G, TU Y B, WANG X Z et al. STAT: spatial-temporal attention mechanism for video captioning[J]. IEEE Transactions on Multimedia, 22, 229-241(2020).
[9] CHEN H R, LIN K, MAYE A et al. A semantics-assisted video captioning model trained with scheduled sampling[J]. Frontiers in Robotics and AI, 7, 475767(2020).
[10] ZHANG Z Q, SHI Y Y, YUAN C F et al. Object relational graph with teacher-recommended learning for video captioning[C], 13275-13285(2020).
[11] PEI W J, ZHANG J Y, WANG X R et al. Memory-attended recurrent network for video captioning[C], 8339-8348(2019).
[12] [12] 12赵海英, 周伟, 侯小刚, 等. 多标签分类的传统民族服饰纹样图像语义理解[J]. 光学 精密工程, 2020, 28(3): 695-703. doi: 10.3788/OPE.20202803.0695ZHAOH Y, ZHOUW, HOUX G, et al. Multi-label classification of traditional national costume pattern image semantic understanding[J]. Opt. Precision Eng., 2020, 28(3): 695-703.(in Chinese). doi: 10.3788/OPE.20202803.0695
[13] SHETTY R, LAAKSONEN J T. Video captioning with recurrent networks based on frame- and video-level features and visual content classification[J]. arXiv preprint(02949).
[14] TU Y B, ZHANG X S, LIU B T et al. Video description with spatial-temporal attention[C], 1014-1022(2017).
[15] GAN Z, GAN C, HE X D et al. Semantic compositional networks for visual captioning[C], 1141-1150(2017).
[16] [16] 16杨其利, 周炳红, 郑伟, 等. 注意力卷积长短时记忆网络的弱小目标轨迹检测[J]. 光学 精密工程, 2020, 28(11): 2535-2548. doi: 10.37188/OPE.20202811.2535YANGQ L, ZHOUB H, ZHENGW, et al. Trajectory detection of small targets based on convolutional long short-term memory with attention mechanisms[J]. Opt. Precision Eng., 2020, 28(11): 2535-2548.(in Chinese). doi: 10.37188/OPE.20202811.2535
[17] XU J, MEI T, YAO T et al. MSR-VTT: a large video description dataset for bridging video and language[C], 5288-5296(2016).
[18] GUADARRAMA S, KRISHNAMOORTHY N, MALKARNENKAR G et al. YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C], 2712-2719(2013).
[19] ZOLFAGHARI M, SINGH K, BROX T. ECO: efficient convolutional network for online video understanding[C], 713-730(2018).
[20] XIE S N, GIRSHICK R, DOLLÁR P et al. Aggregated residual transformations for deep neural networks[C], 5987-5995(2017).
[21] PASUNURU R, BANSAL M. Multi-task video captioning with video and entailment generation[C]. Canada. Stroudsburg, 1273-1283(2017).
[22] PASUNURU R, BANSAL M. Reinforced video captioning with entailment rewards[C], 979-985(2017).
[23] LIU S, REN Z, YUAN J S. SibNet: sibling convolutional encoder for video captioning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 3259-3272(2021).
[24] WANG X, WANG Y F, WANG W Y. Watch, listen, and describe: globally and locally aligned cross-modal attentions for video captioning[C], 795-801(2018).
[25] WANG X, WU J W, ZHANG D et al. Learning to compose topic-aware mixture of experts for zero-shot video captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8965-8972(2019).
[26] WANG B R, MA L, ZHANG W et al. Controllable video captioning with POS sequence guidance based on gated fusion network[C], 2641-2650(2019).
[27] HOU J Y, WU X X, ZHAO W T et al. Joint syntax representation learning and visual cue translation for video captioning[C], 8917-8926(2019).
[28] AAFAQ N, AKHTAR N, LIU W et al. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning[C], 12479-12488(2019).
[29] PAN B X, CAI H Y, HUANG D A et al. Spatio-temporal graph for video captioning with knowledge distillation[C], 10867-10876(2020).
[30] ZHENG Q, WANG C Y, TAO D C. Syntax-aware action targeting for video captioning[C], 13093-13102(2020).
[31] PAN Y W, MEI T, YAO T et al. Jointly modeling embedding and translation to bridge video and language[C], 4594-4602(2016).
[32] YU H N, WANG J, HUANG Z H et al. Video paragraph captioning using hierarchical recurrent neural networks[C], 4584-4593(2016).
[33] GAO L L, GUO Z, ZHANG H W et al. Video captioning with attention-based LSTM and semantic consistency[J]. IEEE Transactions on Multimedia, 19, 2045-2055(2017).
Get Citation
Copy Citation Text
Lin MAO, Hang GAO, Dawei YANG, Rubo ZHANG. Chained semantic generation network for video captioning[J]. Optics and Precision Engineering, 2022, 30(24): 3198
Category: Information Sciences
Received: May. 16, 2022
Accepted: --
Published Online: Feb. 15, 2023
The Author Email: Hang GAO (gao_hang2021@163.com)