Bridging Video and Text: A Two-Step Polishing Transformer for Video Captioning

IEEE Transactions on Circuits and Systems for Video Technology(2022)

引用 5|浏览10
暂无评分
摘要
Video captioning is a joint task of computer vision and natural language processing, which aims to describe the video content using several natural language sentences. Nowadays, most methods cast this task as a mapping problem, which learns a mapping from visual features to natural language and generates captions directly from videos. However, the underlying challenge of video captioning, i .e., sequence to sequence mapping across the different domains, is still not well handled. To address these problems, we introduce the polishing mechanism in an attempt to mimic human polishing process and propose a generate-and-polish framework for video captioning. In this paper, we propose a two-step transformer based polishing network (TSTPN) consisting of two sub-modules: the generation-module is to generate the caption candidate and the polishing-module is to gradually refine the generated candidate. Specifically, the candidate provides a global information of the visual contents in a semantically-meaningful order, where it is firstly considered as a semantic intersnubber to bridge the semantic gap between the text and video, with the cross-modal attention mechanism for better cross-modal modeling; and it secondly provides a global planning ability to maintain the semantic consistency and fluency of the whole sentence for better sequence mapping. In experiments, we present adequate evaluations to show that the proposed TSTPN achieves the comparable and even better performance than the state-of-the-art methods on the benchmark datasets.
更多
查看译文
关键词
Video captioning,transformer,polishing mechanism,cross-modal modeling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要