An Interpretable Multimodal Visual Question Answering System using Attention-based Weighted Contextual Features

AAMAS '19: International Conference on Autonomous Agents and Multiagent Systems Auckland New Zealand May, 2020(2020)

引用 1|浏览67
暂无评分
摘要
Visual question answering (VQA) is a challenging task that requires a deep understanding of language and images. Currently, most VQA algorithms focus on finding the correlations between basic question embeddings and image features by using an element-wise product or bilinear pooling between these two vectors. Some algorithms also use attention models to extract features. In this extended abstract, a novel interpretable multimodal system using attention-based weighted contextual features (MA-WCF) is proposed for VQA tasks. This multimodal system can assign adaptive weights to the features of questions and images themselves and to their contextual features based on their importance. Our new model yields state-of-the-art results on the MS COCO VQA datasets for open-ended question tasks.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要