MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
CVPR 2024(2024)
摘要
Vision-Language Transformers (VLTs) have shown great success recently, but
are meanwhile accompanied by heavy computation costs, where a major reason can
be attributed to the large number of visual and language tokens. Existing token
pruning research for compressing VLTs mainly follows a single-modality-based
scheme yet ignores the critical role of aligning different modalities for
guiding the token pruning process, causing the important tokens for one
modality to be falsely pruned in another modality branch. Meanwhile, existing
VLT pruning works also lack the flexibility to dynamically compress each layer
based on different input samples. To this end, we propose a novel framework
named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for
accelerating various VLTs. Specifically, we first introduce a well-designed
Multi-modality Alignment Guidance (MAG) module that can align features of the
same semantic concept from different modalities, to ensure the pruned tokens
are less important for all modalities. We further design a novel Dynamic Token
Pruning (DTP) module, which can adaptively adjust the token compression ratio
in each layer based on different input instances. Extensive experiments on
various benchmarks demonstrate that MADTP significantly reduces the
computational complexity of kinds of multimodal models while preserving
competitive performance. Notably, when applied to the BLIP model in the NLVR2
dataset, MADTP can reduce the GFLOPs by 80
degradation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要