Bootstrapping a Text Normalization System for an Inflected Language. Numbers as a Test Case

INTERSPEECH(2019)

引用 6|浏览2
暂无评分
摘要
Text normalization is an important part of many natural language applications, in particular for text-to-speech systems. Text normalization poses special challenges for highly inflected languages since the correct morphological form for the normalization is not evident from the non-standard word, e.g. a digit. In this paper we report on ongoing work on a text normalization system for Icelandic, a highly inflected North Germanic language. We describe experiments on the normalization of numbers and address the problem of choosing the correct morphological form of number names. We use language models trained on texts containing number names and inspect effects of different LMs on domain specific texts with a high ratio of digits. A partially class based LM, replacing number names with their part-of-speech tags, shows the best results in all domains. We further show that testing normalization on texts where number names have been converted to digits does not show representative results for texts originally containing digits: while a test set similar to the language model training data shows an error rate of 10.1% on inflected cardinals from 1-99, test sets originally containing digits show 45.3% and 55% error rates.
更多
查看译文
关键词
text normalization, inflected languages, Icelandic
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要