Dual Attention and Question Categorization based Visual Question Answering

IEEE transactions on artificial intelligence(2022)

引用 2|浏览2
暂无评分
摘要
Visual Question Answering (VQA) aims at predicting an answer to a natural language question associated with an image. This work focuses on two important issues pertaining to VQA, which is a complex multimodal AI task. First, the task of answer prediction in a large output answer space. Second, to obtain enriched representation through cross-modality interactions. This work aims to address these two issues by proposing a Dual Attention (DA) and Question Categorization (QC) based Visual Question Answering model (DAQC-VQA). DAQC-VQA has three main network modules. First, a novel dual attention mechanism that helps towards the objective of obtaining an enriched cross-domain representation of the two modalities. Second, a question classifier subsystem for identifying input (natural language) question category. The second module of question categorizer helps in reducing the answer search space. Third, a subsystem for predicting answer depending on the question category. All component networks of DAQC-VQA are trained in an end-to-end manner with a joint loss function. The performance of DAQC-VQA is evaluated on two widely used VQA datasets, viz., TDIUC and VQA2.0. Experimental results demonstrate competitive performance of DAQC-VQA against the recent state-of-art VQA models. An ablation analysis indicates that the enriched representation obtained using the proposed dual attention mechanism helps improve performance.
更多
查看译文
关键词
Attention networks,classification networks,dual attention,multimodal fusion,visual question answering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要