All in an Aggregated Image for In-Image Learning
arxiv(2024)
摘要
This paper introduces a new in-context learning (ICL) mechanism called
In-Image Learning (I^2L) that combines demonstration examples, visual cues,
and chain-of-thought reasoning into an aggregated image to enhance the
capabilities of Large Multimodal Models (e.g., GPT-4V) in multimodal reasoning
tasks. Unlike previous approaches that rely on converting images to text or
incorporating visual input into language models, I^2L consolidates all
information into an aggregated image and leverages image processing,
understanding, and reasoning abilities. This has several advantages: it reduces
inaccurate textual descriptions of complex images, provides flexibility in
positioning demonstration examples, and avoids multiple input images and
lengthy prompts. We also introduce I^2L-Hybrid, a method that combines the
strengths of I^2L with other ICL methods. Specifically, it uses an automatic
strategy to select the most suitable method (I^2L or another certain ICL
method) for a specific task instance. We conduct extensive experiments to
assess the effectiveness of I^2L and I^2L-Hybrid on MathVista, which covers
a variety of complex multimodal reasoning tasks. Additionally, we investigate
the influence of image resolution, the number of demonstration examples in a
single image, and the positions of these demonstrations in the aggregated image
on the effectiveness of I^2L. Our code is publicly available at
https://github.com/AGI-Edgerunners/IIL.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要