Benchmarking benchmarks: introducing new automatic indicators for benchmarking Spoken Language Understanding corpora

INTERSPEECH(2019)

引用 16|浏览46
暂无评分
摘要
Empirical evaluation is nowadays the main evaluation paradigm in Natural Language Processing for assessing the relevance of a new machine-learning based model. If large corpora are available for tasks such as Automatic Speech Recognition, this is not the case for other tasks such as Spoken Language Understanding (SLU), consisting in translating spoken transcriptions into a formal representation often based on semantic frames. Corpora such as ATIS or SNIPS are widely used to compare systems, however differences in performance among systems are often very small, not statistically significant, and can be produced by biases in the data collection or the annotation scheme, as we presented on the ATIS corpus ("Is ATIS too shallow?, IS2018"). We propose in this study a new methodology for assessing the relevance of an SLU corpus. We claim that only taking into account systems performance does not provide enough insight about what is covered by current state-of-the-art models and what is left to be done. We apply our methodology on a set of 4 SLU systems and 5 benchmark corpora (ATIS, SNIPS, M2M, MEDIA) and automatically produce several indicators assessing the relevance (or not) of each corpus for benchmarking SLU models.
更多
查看译文
关键词
benchmarks,language,new automatic indicators
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要