Multi-interaction Network with Object Relation for Video Question Answering

Proceedings of the 27th ACM International Conference on Multimedia(2019)

引用 57|浏览105
暂无评分
摘要
Video question answering is an important task for testing machine's ability of video understanding. The existing methods normally focus on the combination of recurrent and convolutional neural networks to capture spatial and temporal information of the video. Recently, some work has also shown that using attention mechanism can achieve better performance. In this paper, we propose a new model called Multi-interaction network for video question answering. There are two types of interactions in our model. The first type is the multi-modal interaction between the visual and textual information. The second type is the multi-level interaction inside the multi-modal interaction. Specifically, instead of using original self-attention, we propose a new attention mechanism called multi-interaction, which can capture both element-wise and segment-wise sequence interactions, simultaneously. And in addition to the normal frame-level interaction, we also take the object relations into consideration, in order to obtain more fine-grained information, such as motions and other potential relations among these objects. We evaluate our method on TGIF-QA and other two video QA datasets. The qualitative and quantitative experimental results show the effectiveness of our model, which achieves the new state-of-the-art performance.
更多
查看译文
关键词
multi-interaction, object relation, video question answering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要