RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
arxiv(2024)
摘要
CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from
noise image-text pairs to excel at recognizing a wide array of candidates, yet
its focus on broad associations hinders the precision in distinguishing subtle
differences among fine-grained items. Conversely, Multimodal Large Language
Models (MLLMs) excel at classifying fine-grained categories, thanks to their
substantial knowledge from pre-training on web-level corpora. However, the
performance of MLLMs declines with an increase in category numbers, primarily
due to growing complexity and constraints of limited context window size. To
synergize the strengths of both approaches and enhance the few-shot/zero-shot
recognition abilities for datasets characterized by extensive and fine-grained
vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented
method for MLLMs. We initially establish a multi-modal retriever based on CLIP
to create and store explicit memory for different categories beyond the
immediate context window. During inference, RAR retrieves the top-k similar
results from the memory and uses MLLMs to rank and make the final predictions.
Our proposed approach not only addresses the inherent limitations in
fine-grained recognition but also preserves the model's comprehensive knowledge
base, significantly boosting accuracy across a range of vision-language
recognition tasks. Notably, our approach demonstrates a significant improvement
in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot
image recognition datasets, and the 2 object detection datasets under the
zero-shot recognition setting.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要