A Surprising Failure? Multimodal LLMs and the NLVR Challenge
CoRR(2024)
摘要
This study evaluates three state-of-the-art MLLMs – GPT-4V, Gemini Pro, and
the open-source model IDEFICS – on the compositional natural language vision
reasoning task NLVR. Given a human-written sentence paired with a synthetic
image, this task requires the model to determine the truth value of the
sentence with respect to the image. Despite the strong performance demonstrated
by these models, we observe they perform poorly on NLVR, which was constructed
to require compositional and spatial reasoning, and to be robust for semantic
and systematic biases.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要