Optics and Precision Engineering, Volume. 30, Issue 24, 3198(2022)
Chained semantic generation network for video captioning
Aiming to address the unsatisfactory expression ability of semantics, which results in inaccurate text descriptions in video captioning, a chained semantic generation network (ChainS-Net) for video captioning is proposed. A multistage two-branch crossing chained feature extraction structure is constructed that uses global and local domain modules as basic units and captures the video semantics from global and local visual features, respectively. At each stage of the network, semantic information is transformed and parsed between the global and local domains. This method allows visual and semantic information to be cross referenced and improves the semantic expression ability. Furthermore, it allows a more effective semantic representation to be obtained through multistage iterative processing, thereby improving video captioning. Experimental results on MSR-VTT and MSVD datasets show that the proposed ChainS-Net outperforms other similar algorithms. Compared with the semantics-assisted video captioning network, SAVC, ChainS-Net shows average improvements of 2.5% in four metrics of video captioning.
Get Citation
Copy Citation Text
Lin MAO, Hang GAO, Dawei YANG, Rubo ZHANG. Chained semantic generation network for video captioning[J]. Optics and Precision Engineering, 2022, 30(24): 3198
Category: Information Sciences
Received: May. 16, 2022
Accepted: --
Published Online: Feb. 15, 2023
The Author Email: GAO Hang (gao_hang2021@163.com)