COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
CoRR(2024)
摘要
In the evolution of Vision-Language Pre-training, shifting from short-text
comprehension to encompassing extended textual contexts is pivotal. Recent
autoregressive vision-language models like , leveraging
the long-context capability of Large Language Models, have excelled in few-shot
text generation tasks but face challenges in alignment tasks. Addressing this
gap, we introduce the contrastive loss into text generation models, presenting
the COntrastive-Streamlined MultimOdal framework (), strategically
partitioning the language model into dedicated unimodal text processing and
adept multimodal data handling components. , our unified framework,
merges unimodal and multimodal elements, enhancing model performance for tasks
involving textual and visual data while notably reducing learnable parameters.
However, these models demand extensive long-text datasets, yet the availability
of high-quality long-text video datasets remains limited. To bridge this gap,
this work introduces , an inaugural interleaved video-text
dataset featuring comprehensive captions, marking a significant step forward.
Demonstrating its impact, we illustrate how enhances model
performance in image-text tasks. With 34
72% of the available data, our model demonstrates significant superiority over
OpenFlamingo . For instance, in the 4-shot flickr captioning
task, performance notably improves from 57.2
and are underscored by notable performance
gains across 14 diverse downstream datasets encompassing both image-text and
video-text tasks.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要