Exploring Object-Centered External Knowledge for Fine-Grained Video Paragraph Captioning

Guorui Yu, Yimin Hu, Yiqian Xu,Yuejie Zhang,Rui Feng,Tao Zhang,Shang Gao

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览0
暂无评分
摘要
Video paragraph captioning task aims to generate a detailed, fluent and relevant paragraph for a given video. Prior studies often focus on isolating visual objects (potential main components in a sentence) from the overall video content. They rarely explore the latent semantic relations between objects and high-level video concepts, resulting in dull or even incorrect descriptions. To create fine-grained and contextually relevant paragraph captions, we propose a novel framework that constructs a concept graph from a commonsense knowledge base and infers richer semantic meaning from the visual objects. Moreover, we employ a Vision-Guided Concept Selection Network that incorporates an under-sentence supervision mechanism to align the external knowledge with the visual information. Through extensive experiments on ActivityNet captions and YouCook2, the effectiveness of our method is demonstrated compared to state-of-the-art methods.
更多
查看译文
关键词
Video Paragraph Captioning,Common Sense,Sentence-Video Alignment,Object-centered External Knowledg
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要