ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense.
CoRR(2023)
摘要
Humans possess a strong capability for reasoning beyond common sense. For
example, given an unconventional image of a goldfish laying on the table next
to an empty fishbowl, a human would effortlessly determine that the fish is not
inside the fishbowl. The case, however, may be different for a vision-language
model, whose reasoning could gravitate towards the common scenario that the
fish is inside the bowl, despite the visual input. In this paper, we introduce
a novel probing dataset named ROME (reasoning beyond commonsense knowledge) to
evaluate whether the state-of-the-art pre-trained vision-language models have
the reasoning capability to correctly interpret counter-intuitive content. ROME
contains images that defy commonsense knowledge with regards to color, shape,
material, size and positional relation. Experiments on the state-of-the-art
pre-trained vision-language models reveal that most of these models are still
largely incapable of interpreting counter-intuitive scenarios. We hope that
ROME will spur further investigations on reasoning beyond commonsense knowledge
in vision-language research.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要