Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering

2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018)(2018)

引用 33|浏览34
暂无评分
摘要
Visual question answering (VQA) is of significant interest due to its potential to be a strong test of image understanding systems and to probe the connection between language and vision. Despite much recent progress, general VQA is far from a solved problem. In this paper, we focus on the VQA multiple-choice task, and provide some good practices for designing an effective VQA model that can capture language-vision interactions and perform joint reasoning. We explore mechanisms of incorporating part-of-speech (POS) tag guided attention, convolutional n-grams, triplet attention interactions between the image, question and candidate answer, and structured learning for triplets based on image-question pairs. We evaluate our models on two popular datasets: Visual7W and VQA Real Multiple Choice. Our final model achieves the state-of-the-art performance of 68.2% on Visual7W, and a very competitive performance of 69.6% on the test-standard split of VQA Real Multiple Choice.
更多
查看译文
关键词
structured triplet learning,Visual question answering,image understanding systems,VQA multiple-choice task,language-vision interactions,convolutional n-grams,triplet attention interactions,image-question pairs,VQA Real Multiple Choice,VQA model,POS-Tag Guided Attention,part-of-speech tag guided attention,Visual7W
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要