Some Issues Affecting The Transcription Of Hungarian Broadcast Audio

14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5(2013)

引用 27|浏览46
暂无评分
摘要
This paper reports on a speech-to-text (STT) transcription system for Hungarian broadcast audio developed for the 2012 Quaero evaluations. For this evaluation, no manually transcribed audio data were provided for model training, however a small amount of development data were provided to assess system performance. As a consequence, the acoustic models were developed in an unsupervised manner, with the only supervision provided indirectly by the language model. The language models were trained on texts downloaded from various web sites, also without any speech transcripts. This contrasts with other STT systems for Hungarian broadcast audio which use at least 10 to 50 hours of manually transcribed data for acoustic training, and typically include speech transcripts in the language models. Based on mixed results previously reported applying morph-based approaches to agglutinative languages such as Hungarian, word-based language models were used. The initial Word Error Rate (WER) of the system using context independent seed models from other languages of 59.8% on the 3h development corpus was reduced to 25.0% after successive training iterations and system refinement. The same system obtained a WER of 23.3% on the independent Quaero 2012 evaluation corpus (a mix of broadcast news and broadcast conversation data). These results compare well with previously reported systems on similar data. Various issues affecting system performance are discussed, such as amount of training data, the acoustic features and choice of text sources for language model training.
更多
查看译文
关键词
Large vocabulary continuous speech recognition (LVCSR),broadcast news transcription,Hungarian language,unsupervised training,agglutinative languages,Bottleneck MLP features
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要