Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis
CoRR(2024)
摘要
Multi-modal large language models (MLLMs) have demonstrated remarkable
vision-language capabilities, primarily due to the exceptional in-context
understanding and multi-task learning strengths of large language models
(LLMs). The advent of visual instruction tuning has further enhanced MLLMs'
performance in vision-language understanding. However, while existing MLLMs
adeptly recognize what objects are in an image, they still face
challenges in effectively discerning where these objects are,
particularly along the distance (scene depth) axis. To overcome this limitation
in MLLMs, we introduce Proximity Question Answering (Proximity QA), a novel
framework designed to enable MLLMs to infer the proximity relationship between
objects in images. The framework operates in two phases: the first phase
focuses on guiding the models to understand the relative depth of objects, and
the second phase further encourages the models to infer the proximity
relationships between objects based on their depth perceptions. We also propose
a VQA dataset called Proximity-110K, containing additional instructions that
incorporate depth information and the proximity relationships of objects. We
have conducted extensive experiments to validate Proximity QA's superior
ability in depth perception and proximity analysis, outperforming other
state-of-the-art MLLMs. Code and dataset will be released at
https://github.com/NorthSummer/ProximityQA.git.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要