A Study on Manual and Automatic Evaluation for Text Style Transfer: The Case of Detoxification

PROCEEDINGS OF THE 2ND WORKSHOP ON HUMAN EVALUATION OF NLP SYSTEMS (HUMEVAL 2022)(2022)

引用 0|浏览18
暂无评分
摘要
It is often difficult to reliably evaluate models which generate text. Among them, text style transfer is a particularly difficult to evaluate, because its success depends on a number of parameters. We conduct an evaluation of a large number of models on a detoxification task. We explore the relations between the manual and automatic metrics and find that there is only weak correlation between them, which is dependent on the type of model which generated text. Automatic metrics tend to be less reliable for better-performing models. However, our findings suggest that, ChrF and BertScore metrics can be used as a proxy for human evaluation of text detoxification to some extent.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要