Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
CVPR 2024(2024)
摘要
Integration of Large Language Models (LLMs) into visual domain tasks,
resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in
vision-language tasks, particularly for visual question answering (VQA).
However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial
reasoning and localization awareness. Despite generating highly descriptive and
elaborate textual answers, these models fail at simple tasks like
distinguishing a left vs right location. In this work, we explore how
image-space coordinate based instruction fine-tuning objectives could inject
spatial awareness into V-LLMs. We discover optimal coordinate representations,
data-efficient instruction fine-tuning objectives, and pseudo-data generation
strategies that lead to improved spatial awareness in V-LLMs. Additionally, our
resulting model improves VQA across image and video domains, reduces undesired
hallucination, and generates better contextual object descriptions. Experiments
across 5 vision-language tasks involving 14 different datasets establish the
clear performance improvements achieved by our proposed framework.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要