An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers
arxiv(2024)
摘要
Recently, there has been a growing trend of utilizing Large Language Model
(LLM) to evaluate the quality of other LLMs. Many studies have employed
proprietary close-source models, especially GPT4, as the evaluator.
Alternatively, other works have fine-tuned judge models based on open-source
LLMs as the evaluator. In this study, we conduct an empirical study of
different judge models on their evaluation capability. Our findings indicate
that although the fine-tuned judge models achieve high accuracy on in-domain
test sets, even surpassing GPT4, they are inherently task-specific classifiers,
and their generalizability and fairness severely underperform GPT4.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要