MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
CoRR(2023)
Abstract
We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and
deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal
questions from college exams, quizzes, and textbooks, covering six core
disciplines: Art Design, Business, Science, Health Medicine, Humanities
Social Science, and Tech Engineering. These questions span 30 subjects and
183 subfields, comprising 30 highly heterogeneous image types, such as charts,
diagrams, maps, tables, music sheets, and chemical structures. Unlike existing
benchmarks, MMMU focuses on advanced perception and reasoning with
domain-specific knowledge, challenging models to perform tasks akin to those
faced by experts. The evaluation of 14 open-source LMMs as well as the
proprietary GPT-4V(ision) and Gemini highlights the substantial challenges
posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve
accuracies of 56
improvement. We believe MMMU will stimulate the community to build
next-generation multimodal foundation models towards expert artificial general
intelligence.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined