Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate
CoRR(2024)
摘要
Despite the utility of Large Language Models (LLMs) across a wide range of
tasks and scenarios, developing a method for reliably evaluating LLMs across
varied contexts continues to be challenging. Modern evaluation approaches often
use LLMs to assess responses generated by LLMs. However, the meta-evaluation
conducted to assess the effectiveness of these LLMs as evaluators is typically
constrained by the coverage of existing benchmarks or requires extensive human
annotation. This underscores the urgency of methods for scalable
meta-evaluation that can effectively, reliably, and efficiently evaluate the
performance of LLMs as evaluators across diverse tasks and scenarios,
particularly in potentially new, user-defined scenarios. To fill this gap, we
propose ScaleEval, an agent-debate-assisted meta-evaluation framework that
leverages the capabilities of multiple communicative LLM agents. This framework
supports multi-round discussions to assist human annotators in discerning the
most capable LLMs as evaluators, which significantly eases their workload in
cases that used to require large-scale annotations during meta-evaluation. We
release the code for our framework, which is publicly available at:
.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要