When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
CoRR(2024)
摘要
Large Language Model (LLM) leaderboards based on benchmark rankings are
regularly used to guide practitioners in model selection. Often, the published
leaderboard rankings are taken at face value - we show this is a (potentially
costly) mistake. Under existing leaderboards, the relative performance of LLMs
is highly sensitive to (often minute) details. We show that for popular
multiple choice question benchmarks (e.g. MMLU) minor perturbations to the
benchmark, such as changing the order of choices or the method of answer
selection, result in changes in rankings up to 8 positions. We explain this
phenomenon by conducting systematic experiments over three broad categories of
benchmark perturbations and identifying the sources of this behavior. Our
analysis results in several best-practice recommendations, including the
advantage of a hybrid scoring method for answer selection. Our study highlights
the dangers of relying on simple benchmark evaluations and charts the path for
more robust evaluation schemes on the existing benchmarks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要