BTDP: Toward Sparse Fusion with Block Term Decomposition Pooling for Visual Question Answering

ACM Transactions on Multimedia Computing, Communications, and Applications(2019)

引用 4|浏览43
暂无评分
摘要
AbstractBilinear models are very powerful in multimodal fusion tasks like Visual Question Answering. The predominant bilinear methods can all be seen as a kind of tensor-based decomposition operation that contains a key kernel called “core tensor.” Current approaches usually focus on reducing the computation complexity by applying low-rank constraint on the core tensor. In this article, we propose a novel bilinear architecture called Block Term Decomposition Pooling (BTDP), which not only maintains the advantages of previous bilinear methods but also conducts sparse bilinear interactions between modalities. Our method is based on Block Term Decompositions theory of tensor, which will result in a sparse and learnable block-diagonal core tensor for multimodal fusion. We prove that using such a block-diagonal core tensor is equivalent to conducting many “tiny” bilinear operations in different feature spaces. Thus, introducing sparsity into the bilinear operation can significantly increase the performance of feature fusion and improve VQA models. What is more, our BTDP is very flexible in design. We develop several variants of BTDP and discuss the effects of the diagonal blocks of core tensor. Extensive experiments on two challenging VQA-v1 and VQA-v2 datasets show that our BTDP method outperforms current bilinear models, achieving state-of-the-art performance.
更多
查看译文
关键词
Sparse bilinear pooling,block term decomposition pooling,visual question answering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要