Is all that glitters in MT quality estimation really gold standard

international conference on computational linguistics(2016)

引用 22|浏览12
暂无评分
摘要
Human-targeted metrics provide a compromise between human evaluation of machine translation, where high inter-annotator agreement is difficult to achieve, and fully automatic metrics,\r\nsuch as BLEU or TER, that lack the validity of human assessment. Human-targeted translation\r\nedit rate (HTER) is by far the most widely employed human-targeted metric in machine translation, commonly employed, for example, as a gold standard in evaluation of quality estimation.\r\nOriginal experiments justifying the design of HTER, as opposed to other possible formulations,\r\nwere limited to a small sample of translations and a single language pair, however, and this motivates our re-evaluation of a range of human-targeted metrics on a substantially larger scale.\r\nResults show significantly stronger correlation with human judgment for HBLEU over HTER\r\nfor two of the nine language pairs we include and no significant difference between correlations\r\nachieved by HTER and HBLEU for the remaining language pairs. Finally, we evaluate a range of\r\nquality estimation systems employing HTER and direct assessment (DA) of translation adequacy\r\nas gold labels, resulting in a divergence in system rankings, and propose employment of DA for\r\nfuture quality estimation evaluations.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要