Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering

IEEE TRANSACTIONS ON MULTIMEDIA(2023)

引用 0|浏览1
暂无评分
摘要
Recent Transformer architectures (Vaswani et al., 2017) have brought remarkable improvements to visual question answering (VQA). Nevertheless, Transformer-based VQA models are usually deep and wide to guarantee good performance, so they can only run on powerful GPU servers and cannot run on capacity-restricted platforms such as mobile phones. Therefore, it is desirable to learn an elastic VQA model that supports adaptive pruning at runtime to meet the efficiency constraints of different platforms. To this end, we present the bilaterally slimmable Transformer (BST), a general framework that can be seamlessly integrated into arbitrary Transformer-based VQA models to train a single model once and obtain various slimmed submodels of different widths and depths. To verify the effectiveness and generality of this method, we integrate the proposed BST framework with three typical Transformer-based VQA approaches, namely MCAN (Yu et al., 2019), UNITER (Chen et al., 2020), and CLIP-ViL (Shen et al., 2021), and conduct extensive experiments on two commonly-used benchmark datasets. In particular, one slimmed MCAN(BST) submodel achieves comparable accuracy on VQA-v2, while being 0.38x smaller in model size and having 0.27x fewer FLOPs than the reference MCAN model. The smallest MCAN(BST) submodel only has 9 M parameters and 0.16 G FLOPs during inference, making it possible to deploy it on a mobile device with less than 60 ms latency.
更多
查看译文
关键词
Transformers,Adaptation models,Computational modeling,Visualization,Task analysis,Training,Neural networks,Visual question answering,slimmable network,transformer,multimodal learning,efficient deep learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要