Visual7W: Grounded Question Answering in Images

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016)

引用 935|浏览233
暂无评分
摘要
We have seen great progress in basic perceptual tasks such as object recognition and detection. However, AI models still fail to match humans in high-level vision tasks due to the lack of capacities for deeper reasoning. Recently the new task of visual question answering (QA) has been proposed to evaluate a model's capacity for deep image understanding. Previous works have established a loose, global association between QA sentences and images. However, many questions and answers, in practice, relate to local regions in the images. We establish a semantic link between textual descriptions and image regions by object-level grounding. It enables a new type of QA with visual answers, in addition to textual answers used in previous work. We study the visual QA tasks in a grounded setting with a large collection of 7W multiple-choice QA pairs. Furthermore, we evaluate human performance and several baseline models on the QA tasks. Finally, we propose a novel LSTM model with spatial attention to tackle the 7W QA tasks.
更多
查看译文
关键词
LSTM model,7W multiple-choice QA pairs,object-level grounding,image regions,textual descriptions,deep image understanding,visual question answering,AI models,object detection,object recognition,grounded question answering,Visual7W
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要