Watch, Listen, and Answer: Open-ended VideoQA with Modulated Multi-stream 3D ConvNets

29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021)（2021）

引用 2|浏览3

暂无评分

摘要

We propose an open-ended multimodal video question answering (VideoQA) method that predicts textual answers by referring to multimodal information derived from videos. Most current open-ended VideoQA methods focus on motion and appearance features from videos and ignore the audio features that are useful for understanding video content in more detail. A few prior works that use motion, appearance, and audio features showed poor results on public benchmarks since they failed to (e.g., region or grid-level) multimodal features effectively fuse the features with details for video reasoning. We overcame these limitations with multi-stream 3-dimensional convolutional networks (3D ConvNets) and a transformer-based modulator for VideoQA. Our network represents detailed motion and appearance features as well as an audio feature on multiple 3D ConvNets and modulates each intermediate representation with question information to extract their relevant spatiotemporal features over the frames. Based on the question content, our network fuses the multimodal information of 3D ConvNets and predicts the final answers. Our VideoQA method, which effectively combined multimodal data yields, outperformed both a previous multimodal VideoQA method and a state-of-the-art method on standard benchmarks. Visualization suggests that our method can predict the correct answers by listening to the audio information, even when the motion and appearance features are inadequate for understanding the video content.

查看译文

关键词

Video Question Answering

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要