Video Question Answering via Gradually Refined Attention over Appearance and Motion.

MM '17: ACM Multimedia Conference Mountain View California USA October, 2017(2017)

引用 436|浏览340
暂无评分
摘要
Recently image question answering (ImageQA) has gained lots of attention in the research community. However, as its natural extension, video question answering (VideoQA) is less explored. Although both tasks look similar, VideoQA is more challenging mainly because of the complexity and diversity of videos. As such, simply extending the ImageQA methods to videos is insufficient and suboptimal. Particularly, working with the video needs to model its inherent temporal structure and analyze the diverse information it contains. In this paper, we consider exploiting the appearance and motion information resided in the video with a novel attention mechanism. More specifically, we propose an end-to-end model which gradually refines its attention over the appearance and motion features of the video using the question as guidance. The question is processed word by word until the model generates the final optimized attention. The weighted representation of the video, as well as other contextual information, are used to generate the answer. Extensive experiments show the advantages of our model compared to other baseline models. We also demonstrate the effectiveness of our model by analyzing the refined attention weights during the question answering procedure.
更多
查看译文
关键词
Video Question Answering, Attention Mechanism, Neural Network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要