An Assessment of the Accuracy of Automatic Evaluation in Summarization.

NAACL HLT '12: Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization(2012)

引用 42|浏览69
暂无评分
摘要
Automatic evaluation has greatly facilitated system development in summarization. At the same time, the use of automatic evaluation has been viewed with mistrust by many, as its accuracy and correct application are not well understood. In this paper we provide an assessment of the automatic evaluations used for multi-document summarization of news. We outline our recommendations about how any evaluation, manual or automatic, should be used to find statistically significant differences between summarization systems. We identify the reference automatic evaluation metrics---ROUGE 1 and 2---that appear to best emulate human pyramid and responsiveness scores on four years of NIST evaluations. We then demonstrate the accuracy of these metrics in reproducing human judgements about the relative content quality of pairs of systems and present an empirical assessment of the relationship between statistically significant differences between systems according to manual evaluations, and the difference according to automatic evaluations. Finally, we present a case study of how new metrics should be compared to the reference evaluation, as we search for even more accurate automatic measures.
更多
查看译文
关键词
automatic evaluation,significant difference,accurate automatic measure,reference automatic evaluation metrics,NIST evaluation,manual evaluation,reference evaluation,multi-document summarization,new metrics,summarization system
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要