Towards Knowledge-Aware Video Captioning via Transitive Visual Relationship Detection

IEEE Transactions on Circuits and Systems for Video Technology(2022)

引用 9|浏览16
暂无评分
摘要
Video captioning can be enhanced by incorporating the knowledge, which is usually represented as relationships of objects. However, the previous methods construct only superficial or static object relationships, and often introduce noise into the task through irrelevant common sense or fixed syntax templates. These problems mitigate the model interpretability and lead to the undesirable consequence. To overcome these limitations, we propose to enhance video captioning with deep-level object relationships that are adaptively explored during training. Specifically, we present a Transitive Visual Relationship Detection (TVRD) module in which we estimate the actions of the visual objects, and construct an Object-Action Graph (OAG) to describe the shallow relationship between the objects and actions. Then we bridge the gap between the objects via the actions to transitively infer an Object-Object Graph (OOG) which reflects the deep-level relationship. We further feed the OOG to a graph convolutional network to refine the object representation by deep-level relationships. With the refined representation, we capitalize on an LSTM-based decoder for caption generation. Experimental results on two benchmark datasets: MSVD, MSR-VTT demonstrate that the proposed method achieves state-of-the-art performance. Lastly, we present comprehensive ablation studies as well as visualization of visual relationships to demonstrate the effectiveness and interpretability of our model.
更多
查看译文
关键词
Video captioning,multi-modal learning,computer vision,natural language process
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要