Cross-dialect lexicon optimisation for an endangered language ASR system: the case of Irish

Conference of the International Speech Communication Association (INTERSPEECH)(2022)

引用 0|浏览6
暂无评分
摘要
Lexicon optimisation strategies, addressing the problem of dialect divergence, are tested in an ASR system for Irish. As in many endangered languages, Irish has no spoken standard, but rather, three very different dialects of Ulster (Ul), Connaught (Co) and Munster (Mu). Furthermore, the complex sound system and ancient, opaque writing system result in sound-to-grapheme mappings that differ considerably across dialects. A hybrid ASR system was trained on (predominantly) native speaker speech data, balanced across the dialects. Experiment 1 tested whether a Global lexicon, which captures dialect variant forms with relatively abstract representations, can perform as well as a Multi-dialect lexicon containing all dialect variants. Three dialect-specific lexicons were also included in the tests. The Global lexicon did yield the best performance and experiment 2 tested whether further reductions to its phoneset might further enhance its performance. These included (i) merging a Tense-Lax contrast among coronal sonorants, not common to all dialects, and (ii) merging the contrast of voiceless-voiced sonorants, as the voiceless member is relatively infrequent. Results showed but a slight enhancement and only for Mu dialect, which is the one most aligned to the phoneset reduction.
更多
查看译文
关键词
Irish, speech recognition, cross-dialect variation, lexicon, minority language
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要