Bridging Vision and Language Spaces with Assignment Prediction
arxiv(2024)
摘要
This paper introduces VLAP, a novel approach that bridges pretrained vision
models and large language models (LLMs) to make frozen LLMs understand the
visual world. VLAP transforms the embedding space of pretrained vision models
into the LLMs' word embedding space using a single linear layer for efficient
and general-purpose visual and language understanding. Specifically, we harness
well-established word embeddings to bridge two modality embedding spaces. The
visual and text representations are simultaneously assigned to a set of word
embeddings within pretrained LLMs by formulating the assigning procedure as an
optimal transport problem. We predict the assignment of one modality from the
representation of another modality data, enforcing consistent assignments for
paired multimodal data. This allows vision and language representations to
contain the same information, grounding the frozen LLMs' word embedding space
in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved
with visual data since the LLMs interpret and reason linguistic information
from correlations between word embeddings. Experimental results show that VLAP
achieves substantial improvements over the previous linear transformation-based
approaches across a range of vision-language tasks, including image captioning,
visual question answering, and cross-modal retrieval. We also demonstrate the
learned visual representations hold a semantic taxonomy of LLMs, making visual
semantic arithmetic possible.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要