NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models
arxiv(2024)
摘要
Understanding the reasoning capabilities of Multimodal Large Language Models
(MLLMs) is an important area of research. In this study, we introduce a dynamic
benchmark, NPHardEval4V, aimed at addressing the existing gaps in evaluating
the pure reasoning abilities of MLLMs. Our benchmark aims to provide a venue to
disentangle the effect of various factors such as image recognition and
instruction following, from the overall performance of the models, allowing us
to focus solely on evaluating their reasoning abilities. It is built by
converting textual description of questions from NPHardEval to image
representations. Our findings reveal significant discrepancies in reasoning
abilities across different models and highlight the relatively weak performance
of MLLMs compared to LLMs in terms of reasoning. We also investigate the impact
of different prompting styles, including visual, text, and combined visual and
text prompts, on the reasoning abilities of MLLMs, demonstrating the different
impacts of multimodal inputs in model performance. Unlike traditional
benchmarks, which focus primarily on static evaluations, our benchmark will be
updated monthly to prevent overfitting and ensure a more authentic and
fine-grained evaluation of the models. We believe that this benchmark can aid
in understanding and guide the further development of reasoning abilities in
MLLMs. The benchmark dataset and code are available at
https://github.com/lizhouf/NPHardEval4V
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要