DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM
arxiv(2024)
摘要
We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot
object detection ability of multimodal large language models (MLLMs), such as
GPT-4V and Gemini. Our approach consists of a detection prompting toolkit
inspired by high-precision detection priors and a new Chain-of-Thought to
implement these prompts. Specifically, the prompts in the toolkit are designed
to guide the MLLM to focus on regional information (e.g., zooming in), read
coordinates according to measure standards (e.g., overlaying rulers and
compasses), and infer from the contextual information (e.g., overlaying scene
graphs). Building upon these tools, the new detection chain-of-thought can
automatically decompose the task into simple subtasks, diagnose the
predictions, and plan for progressive box refinements. The effectiveness of our
framework is demonstrated across a spectrum of detection tasks, especially hard
cases. Compared to existing state-of-the-art methods, GPT-4V with our
DetToolChain improves state-of-the-art object detectors by +21.5
COCO Novel class set for open-vocabulary detection, +24.23
set for zero-shot referring expression comprehension, +14.5
describe object detection FULL setting.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要