Ctc Training Of Multi-Phone Acoustic Models For Speech Recognition

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION(2017)

引用 7|浏览11
暂无评分
摘要
Phone-sized acoustic units such as triphones cannot properly capture the long-term co-articulation effects that occur in spontaneous speech. For that reason, it is interesting to construct acoustic units covering a longer time-span such as syllables or words. Unfortunately, the frequency distribution of those units is such that a few high frequency units account for most of the tokens, while many units rarely occur. As a result, those units suffer from data sparsity and can be difficult to train. In this paper we propose a scalable data-driven approach to construct a set of salient units made of sequences of phones called M-phones. We illustrate that since the decomposition of a word sequence into a sequence of M-phones is ambiguous, those units arc well suited to he used with a connectionist temporal classification (CTC) approach which does not rely on an explicit frame-level segmentation of the word sequence into a sequence of acoustic units. Experiments are presented on a Voice Search task using 12.500 hours of training data.
更多
查看译文
关键词
acoustic modeling, CTC, multi-phone units, pronunciation modeling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要