Visual Question Answering With Dense Inter- and Intra-Modality Interactions

Fei Liu,Jing Liu,Zhiwei Fang,Richang Hong,Hanqing Lu

IEEE TRANSACTIONS ON MULTIMEDIA（2021）

引用 18|浏览78

暂无评分

摘要

Learning effective interactions between multi-modal features is at the heart of visual question answering (VQA). A common defect of the existing VQA approaches is that they only consider a very limited amount of inter-modality interactions, which may be not enough to model latent complex image-question relations that are necessary for accurately answering questions. Besides, most methods neglect the modeling of the intra-modality interactions that is also important to VQA. In this work, we propose a novel DenIII framework for modeling dense inter-, and intra-modality interactions. It densely connects all pairwise layers of the network via the proposed Inter-, and Intra-modality Attention Connectors, capturing fine-grained interplay across all hierarchical levels. The Inter-modality Attention Connector efficiently connects the multi-modality features at any two layers with bidirectional attention, capturing the inter-modality interactions. While the Intra-modality Attention Connector connects the features of the same modality with unidirectional attention, and models the intra-modality interactions. Extensive ablation studies, and visualizations validate the effectiveness of our method, and DenIII achieves state-of-the-art or competitive performance on three publicly available datasets.

查看译文

关键词

Visualization, Knowledge discovery, Connectors, Encoding, Task analysis, Image coding, Stacking, Visual question answering, attention, dense interactions

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要