Relation-Aware Image Captioning for Explainable Visual Question Answering

Ching-Shan Tseng,Ying-Jia Lin,Hung-Yu Kao

2022 International Conference on Technologies and Applications of Artificial Intelligence (TAAI)(2022)

引用 0|浏览1
暂无评分
摘要
Recent studies leveraging object detection models for Visual Question Answering (VQA) ignore the correlations or interactions between multiple objects. In addition, the previous VQA models are black boxes for human beings, which means it is difficult to explain why a model returns correct or wrong answers. To solve the problems above, we propose a new model structure with image captioning for the VQA task. Our model constructs a relation graph according to the relative positions between region pairs and then produces relation-aware visual features with a relation encoder. To make the predictions explainable, we introduce an image captioning module and conduct a multi-task training process. In the meantime, the generated captions are injected into the predictor to assist cross-modal understanding. Experiments show that our model can generate meaningful answers and explanations according to the questions and images. Besides, the relation encoder and the caption-attended predictor lead to improvement for different types of questions.
更多
查看译文
关键词
visual question answering,image captioning,explainable VQA,cross-modality learning,multi-task learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要