Cascade transformers with dynamic attention for video question answering

Yimin Jiang, Tingfei Yan, Mingze Yao,Huibing Wang, Wenzhe Liu

Computer Vision and Image Understanding(2024)

引用 0|浏览0
暂无评分
摘要
Visual question answering (VQA) has become a hot study topic with challenging motivation of correctly answering the videos or images questions in recent years. However, the existing VQA model mostly aimed at answering questions about images and performed poorly in the video question answering (VideoQA) domain. VideoQA needs to simultaneously consider the correlations between video frames and the dynamic information of multiple objects in video. Therefore, we propose a novel Cascade Transformers with Dynamic Attention for Video Question Answering (CTDA-QA), which aims to simultaneously solve the above considerations. Specifically, the proposed CTDA-QA model utilizes multiple transformers structure to encode videos for reasoning complex spatial and temporal information, which is different from the previous recurrent neural network methods. Besides, in order to effectively capture the dynamic information from various scenarios in videos, a flexible attention module has been proposed to explore the essential relations between objects in a dynamic timeline. Finally, to avoid spurious answers and fully explore the cross-modal relationships, a mixed-supervised learning strategy is designed for optimizing the reasoning tasks. The experiments on several benchmark video question–answer datasets clearly verify the performance and effectiveness of CTDA-QA, which contains the results in contrast to the state-of-the-art methods. Besides, the provided ablation study and visualization results further reveal the potential of CTDA-QA.
更多
查看译文
关键词
Video question answering,Cascade transformers,Dynamic attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要