Unsupervised Training On A Large Amount Of Arabic Broadcast News Data

ICASSP (2)(2007)

引用 15|浏览23
暂无评分
摘要
The unsupervised training we carried out on the 1,858-hour untranscribed Arabic Broadcast News (BN) data yields a sizable gain. However, this gain is only about half of that achieved on the 1,900-hour English BN data. This paper presents our efforts that aim at enlarging the gain on the Arabic data. These efforts include a design of an explicit hypothesis-confidence-estimating method for the data selection, use of new features and neural networks (NN) to improve hypothesis-confidence estimation, and alleviation of the over-fitting problem existing in the estimation. Our experiments show that both the explicit bypothesis-confidence-estimating method and the use of new features improve the estimation and render the unsupervised training extra gains; the use of neural networks doesn't significantly improve the confidence estimation; the alleviation of the over-fitting problem is not significant enough to decrease the word error rate (WER). This paper also presents improvements of unsupervised training we conducted on a morpheme-based Arabic system and on models trained with maximum mutual information (MMI) criterion.
更多
查看译文
关键词
speech recognition,unsupervised training,confidence estimation,Arabic broadcast news
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要