IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models
arxiv(2024)
摘要
The advent of Vision Language Models (VLM) has allowed researchers to
investigate the visual understanding of a neural network using natural
language. Beyond object classification and detection, VLMs are capable of
visual comprehension and common-sense reasoning. This naturally led to the
question: How do VLMs respond when the image itself is inherently unreasonable?
To this end, we present IllusionVQA: a diverse dataset of challenging optical
illusions and hard-to-interpret scenes to test the capability of VLMs in two
distinct multiple-choice VQA tasks - comprehension and soft localization.
GPT4V, the best-performing VLM, achieves 62.99
comprehension task and 49.7
Chain-of-Thought). Human evaluation reveals that humans achieve 91.03
accuracy in comprehension and localization. We discover that In-Context
Learning (ICL) and Chain-of-Thought reasoning substantially degrade the
performance of GeminiPro on the localization task. Tangentially, we discover a
potential weakness in the ICL capabilities of VLMs: they fail to locate optical
illusions even when the correct answer is in the context window as a few-shot
example.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要