Generative Attention Model with Adversarial Self-learning for Visual Question Answering.

MM '17: ACM Multimedia Conference Mountain View California USA October, 2017(2017)

引用 20|浏览10
暂无评分
摘要
Visual question answering (VQA) is arguably one of the most challenging multimodal understanding problems as it requires reasoning and deep understanding of the image, the question, and their semantic relationship. Existing VQA methods heavily rely on attention mechanisms to semantically relate the question words with the image contents for answering the related questions. However, most of the attention models are simplified as a linear transformation, over the multimodal representation, which we argue is insufficient for capturing the complex nature of the multimodal data. In this paper we propose a novel generative attention model obtained by adversarial self-learning. The proposed adversarial attention produces more diverse visual attention maps and it is able to generalize the attention better to new questions. The experiments show the proposed adversarial attention leads to a state-of-the-art VQA model on the two VQA benchmark datasets, VQA v1.0 and v2.0.
更多
查看译文
关键词
Visual Question Answering, Multimodal Representation, Adversarial Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要