UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
CVPR 2024(2024)
摘要
We present UniBind, a flexible and efficient approach that learns a unified
representation space for seven diverse modalities– images, text, audio, point
cloud, thermal, video, and event data. Existing works, eg., ImageBind, treat
the image as the central modality and build an image-centered representation
space; however, the space may be sub-optimal as it leads to an unbalanced
representation space among all modalities. Moreover, the category names are
directly used to extract text embeddings for the downstream tasks, making it
hardly possible to represent the semantics of multi-modal data. The
'out-of-the-box' insight of our UniBind is to make the alignment center
modality-agnostic and further learn a unified and balanced representation
space, empowered by the large language models (LLMs). UniBind is superior in
its flexible application to all CLIP-style models and delivers remarkable
performance boosts. To make this possible, we 1) construct a knowledge base of
text embeddings with the help of LLMs and multi-modal LLMs; 2) adaptively build
LLM-augmented class-wise embedding center on top of the knowledge base and
encoded visual embeddings; 3) align all the embeddings to the LLM-augmented
embedding center via contrastive learning to achieve a unified and balanced
representation space. UniBind shows strong zero-shot recognition performance
gains over prior arts by an average of 6.36
state-of-the-art performance, eg., a 6.75
fine-tuning setting while reducing 90
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要