Question-Aware Tube-Switch Network for Video Question Answering

Proceedings of the 27th ACM International Conference on Multimedia(2019)

引用 25|浏览62
暂无评分
摘要
Video Question & Answering (VideoQA), a task to answer questions in videos, involves rich spatio-temporal content (e.g., appearance and motion) and requires multi-hop reasoning process. However, existing methods usually deal with appearance and motion separately and fail to synchronize the attentions on appearance and motion features, neglecting two key properties of video QA: (1) appearance and motion features are usually concomitant and complementary to each other at time slice level. Some questions rely on joint representations of both kinds of features at some point in the video; (2) appearance and motion have different importance in multi-step reasoning. In this paper, we propose a novel Question- Aware Tube-Switch Network (TSN) for video question answering which contains (1) a Mix module to synchronously combine the appearance and motion representation at time slice level, achieving fine-grained temporal alignment and correspondence between appearance and motion at every time slice and (2) a Switch mod- ule to adaptively choose appearance or motion tube as primary at each reasoning step, guiding the multi-hop reasoning process. To end-to-end train TSN, we utilize the Gumbel-Softmax strategy to account for the discrete tube-switch process. Extensive experimental results on two benchmarks: MSVD-QA and MSRVTT-QA, have demonstrated that the proposed TSN consistently outperforms state-of-the-art on all metrics.
更多
查看译文
关键词
appearance and motion, video question answering, visual attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要