Improving Visual Question Answering by Leveraging Depth and Adapting Explainability

2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)(2022)

Cited 0|Views18
No score
During human-robot conversation, it is critical for robots to be able to answer users’ questions accurately and provide a suitable explanation for why they arrive at the answer they provide. Depth is a crucial component in producing more intelligent robots that can respond correctly as some questions might rely on spatial relations within the scene, for which 2D RGB data alone would be insufficient. Due to the lack of existing depth datasets for the task of VQA, we introduce a new dataset, VQA-SUNRGBD. When we compare our proposed model on this RGB-D dataset against the baseline VQN network on RGB data alone, we show that ours outperforms, particularly in questions relating to depth such as asking about the proximity of objects and relative positions of objects to one another. We also provide Grad-CAM activations to gain insight regarding the predictions on depth-related questions and find that our method produces better visual explanations compared to Grad-CAM on RGB data. To our knowledge, this work is the first of its kind to leverage depth and an explainability module to produce an explainable Visual Question Answering (VQA) system.
Translated text
Key words
Visual Question Answering,Leveraging Depth,Explainability
AI Read Science
Must-Reading Tree
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined