Semantic Tag Augmented XlanV Model for Video Captioning

International Multimedia Conference(2021)

引用 5|浏览30
暂无评分
摘要
ABSTRACTThe key of video captioning is to leverage the cross-modal information from both vision and language perspectives. We propose to leverage the semantic tags to bridge the gap between these modalities rather than directly concatenating or attending to the visual and linguistic features as the previous works. The semantic tags are the object tags and the action tags detected in videos, which can be viewed as partial captions for the input video. To effectively exploit the semantic tags, we design a Semantic Tag augmented XlanV (ST-XlanV) model which encodes 4 kinds of visual and semantic features with X-Linear Attention based cross-attention modules. Moreover, tag related tasks are also designed in the pre-training stage to aid the model more fruitfully exploits the cross-modal information. The proposed model reaches the 5th place in the pre-training for video captioning challenge with the help of the semantic tags. Our codes will be available at: https://github.com/RubickH/ST-XlanV.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要