Automatic Speech Recognition for Irish: testing lexicons and language models

2022 33rd Irish Signals and Systems Conference (ISSC)(2022)

引用 0|浏览6
暂无评分
摘要
A range of lexicons and language models were tested in the development of ASR for Irish. One problem, common among minority languages, is the multiplicity of dialects, with no one spoken standard. To address this challenge, in a hybrid ASR system two alternative cross-dialect lexicons are tested, which draw on research in dialect phonology. First, individual lexicons were built for the three main dialects of Ulster (Ul), Connaught (Co) and Munster (Mu). With these, a Multi-dialect lexicon incorporated all dialect-varying word forms. An alternative Global lexicon, essentially a trans-dialect lexicon, used abstract representations of dialect-varying forms (phoneme or morpheme sized units). These two cross-dialect lexicons were tested along with the three dialect-specific lexicons. Several different language models were also tested. Results for the Global and Multi-dialect lexicons were found to yield the highest performance, with the lowest overall WER for the latter. There were considerable differences in results for the individual dialect lexicons: this may reflect a bias in the datasets used or could be indicators of the linguistic distance between the dialects - competing hypotheses that will need more rigorous testing. Results showed a strong effect of the language model used. Error patterns show frequent substitutions involving inflected forms.
更多
查看译文
关键词
Irish,speech recognition,cross-dialect variation,lexicon,language model,minority language
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要