Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
arxiv(2024)
摘要
Vision Transformers (ViTs) have become increasingly popular in large-scale
Vision and Language Pre-training (VLP) models. Although previous VLP research
has demonstrated the efficacy of ViTs, these efforts still struggle with
computational inefficiencies caused by lengthy visual sequences. To address
this challenge, we introduce an efficient VLP approach called TRIPS, which
stands for Text-Relevant Image Patch Selection. TRIPS progressively reduces the
visual sequence using a text-guided patch-selection layer in the visual
backbone, thereby accelerating both training and inference processes. This
patch-selection layer dynamically computes text-dependent visual attention,
enabling it to identify attentive image tokens with text guidance and fuse
inattentive ones in an end-to-end fashion. Importantly, TRIPS does not add any
extra parameters and generalizes to most ViT-based VLP models. We incorporate
TRIPS into three representative VLP models covering single-stream, dual-stream,
and generative paradigms, and conduct extensive experiments on five widely-used
multi-modal benchmark datasets. Our experimental results reveal that TRIPS
delivers a 40
on downstream tasks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要