PONE: A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems

ACM Transactions on Information Systems(2020)

引用 42|浏览160
暂无评分
摘要
AbstractOpen-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them is still a big challenge. As far as we know, there are three kinds of automatic evaluations for open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics is more effective. In this article, we first measure systematically all kinds of metrics to check which kind is best. Extensive experiments demonstrate that learning-based metrics are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains extremely imbalanced and low-quality samples to train a score model. To address this issue, we propose a novel learning-based metric that significantly improves the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Extensive experiments demonstrate that PONE significantly outperforms the state-of-the-art learning-based evaluation method. Besides, we have publicly released the codes of our proposed metric and state-of-the-art baselines.1
更多
查看译文
关键词
Open-domain, automatic evaluation, generative dialogue systems
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要