SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
CoRR(2024)
摘要
Multimodal alignment between language and vision is the fundamental topic in
current vision-language model research. Contrastive Captioners (CoCa), as a
representative method, integrates Contrastive Language-Image Pretraining (CLIP)
and Image Caption (IC) into a unified framework, resulting in impressive
results. CLIP imposes a bidirectional constraints on global representation of
entire images and sentences. Although IC conducts an unidirectional
image-to-text generation on local representation, it lacks any constraint on
local text-to-image reconstruction, which limits the ability to understand
images at a fine-grained level when aligned with texts. To achieve multimodal
alignment from both global and local perspectives, this paper proposes
Symmetrizing Contrastive Captioners (SyCoCa), which introduces bidirectional
interactions on images and texts across the global and local representation
levels. Specifically, we expand a Text-Guided Masked Image Modeling (TG-MIM)
head based on ITC and IC heads. The improved SyCoCa can further leverage
textual cues to reconstruct contextual images and visual cues to predict
textual contents. When implementing bidirectional local interactions, the local
contents of images tend to be cluttered or unrelated to their textual
descriptions. Thus, we employ an attentive masking strategy to select effective
image patches for interaction. Extensive experiments on five vision-language
tasks, including image-text retrieval, image-captioning, visual question
answering, and zero-shot/finetuned image classification, validate the
effectiveness of our proposed method.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要