Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks

IEEE Transactions on Circuits and Systems for Video Technology(2021)

引用 36|浏览132
暂无评分
摘要
Long-term Video Question Answering plays an essential role in visual information retrieval, which aims at generating natural language answers to discretionary free-form questions about the referenced long-term video. Rather than remember the video as a sequence of visual content, humans have an innate cognitive ability to identify the critical moments related to the question at first glance, then tie together the specific evidence around these critical moments for further analysis and reasoning. Motivated by this intuition, we propose the multimodal hierarchical memory attentive networks with two heterogeneous memory subnetworks: the top guided memory network and the bottom enhanced multimodal memory attentive network. The top guided memory network serves as a shallow inference engine to pick relevant and informative moments of questions and obtain salient video content at a coarse-grained level. Subsequently, the bottom enhanced multimodal memory attentive network is designed as an in-depth reasoning engine to perform more accurate attention with cues from video bottom evidence in a fine-grained level to enhance question answering quality. We evaluate the proposed method on three publicly available video question answering benchmarks, namely ActivityNet-QA, MSRVTT-QA, and MSVD-QA. Experimental results demonstrate that the proposed approach significantly outperforms other state-of-the-art methods for long-term videos. Extensive ablation studies are carried out to explore the reasons behind the proposed model's effectiveness.
更多
查看译文
关键词
Long-term,video question answering,multimodal,hierarchical,memory network,shallow inference,coarse-grained,fine-grained,in-depth reasoning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要