MoVA: Adapting Mixture of Vision Experts to Multimodal Context
arxiv(2024)
摘要
As the key component in multimodal large language models (MLLMs), the ability
of the visual encoder greatly affects MLLM's understanding on diverse image
content. Although some large-scale pretrained vision encoders such as vision
encoders in CLIP and DINOv2 have brought promising performance, we found that
there is still no single vision encoder that can dominate various image content
understanding, e.g., the CLIP vision encoder leads to outstanding results on
general image understanding but poor performance on document or chart content.
To alleviate the bias of CLIP vision encoder, we first delve into the inherent
behavior of different pre-trained vision encoders and then propose the MoVA, a
powerful and novel MLLM, adaptively routing and fusing task-specific vision
experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design
a context-aware expert routing strategy to dynamically select the most suitable
vision experts according to the user instruction, input image, and expertise of
vision experts. This benefits from the powerful model function understanding
ability of the large language model (LLM) equipped with expert-routing low-rank
adaptation (LoRA). In the fine-grained stage, we elaborately conduct the
mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse
task-specific knowledge from various experts. This coarse-to-fine paradigm
effectively leverages representations from experts based on multimodal context
and model expertise, further enhancing the generalization ability. We conduct
extensive experiments to evaluate the effectiveness of the proposed approach.
Without any bells and whistles, MoVA can achieve significant performance gains
over current state-of-the-art methods in a wide range of challenging multimodal
benchmarks. Codes and models will be available at
https://github.com/TempleX98/MoVA.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要