Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches
arxiv(2024)
摘要
Two approaches have emerged to input images into large language models
(LLMs). The first is to caption images into natural language. The second is to
map image feature embeddings into the domain of the LLM and pass the mapped
embeddings directly to the LLM. The majority of recent few-shot multimodal work
reports performance using architectures that employ variations of one of these
two approaches. But they overlook an important comparison between them. We
design a controlled and focused experiment to compare these two approaches to
few-shot visual question answering (VQA) with LLMs. Our findings indicate that
for Flan-T5 XL, a 3B parameter LLM, connecting visual embeddings directly to
the LLM embedding space does not guarantee improved performance over using
image captions. In the zero-shot regime, we find using textual image captions
is better. In the few-shot regimes, how the in-context examples are selected
determines which is better.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要