Fusing Attention With Visual Question Answering

2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)(2017)

引用 2|浏览10
暂无评分
摘要
Visual Question Answering is a complex problem that fuses natural language and image processing to answer a question based on information from the image. The basic architecture for accomplishing this is using a CNN to extract features from the image and an RNN for the language processing, then combine the two in an MLP to produce an answer. These architectures perform well at identifying content, but fail at higher level reasoning such as spatial awareness and combining objects. To help remedy this, we propose using attention to divide the image into separate objects, then using the extracted features along with the location and size information to learn the MLP.
更多
查看译文
关键词
Question Answering, Saliency, Deep Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要