ReGR: Relation-aware graph reasoning framework for video question answering

INFORMATION PROCESSING & MANAGEMENT(2023)

引用 0|浏览11
暂无评分
摘要
As one of the challenging cross-modal tasks, video question answering (VideoQA) aims to fully understand video content and answer relevant questions. The mainstream approach in current work involves extracting appearance and motion features to characterize videos separately, ignoring the interactions between them and with the question. Furthermore, some crucial semantic interaction details between visual objects are overlooked. In this paper, we propose a novel Relation-aware Graph Reasoning (ReGR) framework for video question answering, which first combines appearance-motion and location-semantic multiple interaction relations between visual objects. For the interaction between appearance and motion, we design the Appearance- Motion Block, which is question-guided to capture the interdependence between appearance and motion. For the interaction between location and semantics, we design the Location-Semantic Block, which utilizes the constructed Multi-Relation Graph Attention Network to capture the geometric position and semantic interaction between objects. Finally, the question-driven Multi -Visual Fusion captures more accurate multimodal representations. Extensive experiments on three benchmark datasets, TGIF-QA, MSVD-QA, and MSRVTT-QA, demonstrate the superiority of our proposed ReGR compared to the state-of-the-art methods.
更多
查看译文
关键词
Video question answering,Cross-modal,Graph neural network,Interaction relations reasoning,Attention mechanism
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要