TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks
arxiv(2023)
摘要
We present TIGERScore, a Trained metric that follows
Instruction Guidance to perform Explainable, and
Reference-free evaluation over a wide spectrum of text generation
tasks. Different from other automatic evaluation methods that only provide
arcane scores, TIGERScore is guided by natural language instruction to provide
error analysis to pinpoint the mistakes in the generated text. Our metric is
based on LLaMA-2, trained on our meticulously curated instruction-tuning
dataset MetricInstruct which covers 6 text generation tasks and 23 text
generation datasets. The dataset consists of 42K quadruple in the form of
(instruction, input, system output → error analysis). We collected
the `system outputs' through from a large variety of models to cover different
types of errors. To quantitatively assess our metric, we evaluate its
correlation with human ratings on 5 held-in datasets, 2 held-out datasets and
show that TIGERScore can achieve the open-source SoTA correlation with human
ratings across these datasets and almost approaches GPT-4 evaluator. As a
reference-free metric, its correlation can even surpass the best existing
reference-based metrics. To further qualitatively assess the rationale
generated by our metric, we conduct human evaluation on the generated
explanations and found that the explanations are 70.8% accurate. Through these
experimental results, we believe TIGERScore demonstrates the possibility of
building universal explainable metrics to evaluate any text generation task.
All the resourced are released in our project website:
.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要