We Need to Talk About Classification Evaluation Metrics in NLP
CoRR(2024)
摘要
In Natural Language Processing (NLP) classification tasks such as topic
categorisation and sentiment analysis, model generalizability is generally
measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC. The
diversity of metrics, and the arbitrariness of their application suggest that
there is no agreement within NLP on a single best metric to use. This lack
suggests there has not been sufficient examination of the underlying heuristics
which each metric encodes. To address this we compare several standard
classification metrics with more 'exotic' metrics and demonstrate that a
random-guess normalised Informedness metric is a parsimonious baseline for task
performance. To show how important the choice of metric is, we perform
extensive experiments on a wide range of NLP tasks including a synthetic
scenario, natural language understanding, question answering and machine
translation. Across these tasks we use a superset of metrics to rank models and
find that Informedness best captures the ideal model characteristics. Finally,
we release a Python implementation of Informedness following the SciKitLearn
classifier format.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要