OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
CoRR(2024)
摘要
Recent advancements have seen Large Language Models (LLMs) and Large
Multimodal Models (LMMs) surpassing general human capabilities in various
tasks, approaching the proficiency level of human experts across multiple
domains. With traditional benchmarks becoming less challenging for these
models, new rigorous challenges are essential to gauge their advanced
abilities. In this work, we present OlympiadBench, an Olympiad-level bilingual
multimodal scientific benchmark, featuring 8,952 problems from Olympiad-level
mathematics and physics competitions, including the Chinese college entrance
exam. Each problem is detailed with expert-level annotations for step-by-step
reasoning. Evaluating top-tier models on OlympiadBench, we implement a
comprehensive assessment methodology to accurately evaluate model responses.
Notably, the best-performing model, GPT-4V, attains an average score of 17.23
on OlympiadBench, with a mere 11.28
rigor and the intricacy of physical reasoning. Our analysis orienting GPT-4V
points out prevalent issues with hallucinations, knowledge omissions, and
logical fallacies. We hope that our challenging benchmark can serve as a
valuable resource for helping future AGI research endeavors.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要