PATCH – Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Mathematics Proficiency
arxiv(2024)
摘要
Many existing benchmarks of large (multimodal) language models (LLMs) focus
on measuring LLMs' academic proficiency, often with also an interest in
comparing model performance with human test takers. While these benchmarks have
proven key to the development of LLMs, they suffer from several limitations,
including questionable measurement quality (e.g., Do they measure what they are
supposed to in a reliable way?), lack of quality assessment on the item level
(e.g., Are some items more important or difficult than others?) and unclear
human population reference (e.g., To whom can the model be compared?). In
response to these challenges, we propose leveraging knowledge from
psychometrics - a field dedicated to the measurement of latent variables like
academic proficiency - into LLM benchmarking. We make three primary
contributions. First, we introduce PATCH: a novel framework for
Psychometrics-AssisTed benCHmarking of LLMs. PATCH addresses the aforementioned
limitations, presenting a new direction for LLM benchmark research. Second, we
implement PATCH by measuring GPT-4 and Gemini-Pro-Vision's proficiency in 8th
grade mathematics against 56 human populations. We show that adopting a
psychometrics-based approach yields evaluation outcomes that diverge from those
based on existing benchmarking practices. Third, we release 4 datasets to
support measuring and comparing LLM proficiency in grade school mathematics and
science against human populations.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要