Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data

INTERSPEECH(2019)

引用 14|浏览4
暂无评分
摘要
When building automatic speech recognition (ASR) systems, typically some amount of audio and text data in the target language is needed. While text data can be obtained relatively easily across many languages, transcribed audio data is challenging to obtain. This presents a barrier to making voice technologies available in more languages of the world. In this paper, we present a way to build an ASR system system for a language even in the absence of any audio training data in that language at all. We do this by simply re-using an existing acoustic model from a phonologically similar language, without any kind of modification or adaptation towards the target language. The basic insight is that, if two languages are sufficiently similar in terms of their phonological system, an acoustic model should hold up relatively well when used for another language. We describe how we tailor our pronunciation models to enable such re-use, and show experimental results across a number of languages from various language families. We also provide a theoretical analysis of situations in which this approach is likely to work. Our results show that it is possible to achieve less than 20% word error rate (WER) using this method.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络