An E2E-ASR-Based Iteratively-Trained Timestamp Estimator

IEEE SIGNAL PROCESSING LETTERS(2022)

引用 1|浏览17
暂无评分
摘要
Text-to-speech alignment, also known as time alignment, is essential for automatic speech recognition (ASR) systems used for speech retrieval tasks, such as keyword search and speech segment extraction. Previous works have used the Gaussian mixture model-hidden Markov model (GMM-HMM) forced alignment to improve the alignment performance. However, when used with end-to-end (E2E) ASR, GMM-HMM forced alignment causes extra reliance on expertise such as pronunciation lexica. It also increases the system complexity because GMM-HMMs are very dissimilar to E2E models. To tackle these two problems, we propose an E2E-ASR-based iteratively-trained timestamp estimator (ITSE), which performs alignment between token-level transcription and speech. We train ITSE first with coarse initial alignment targets generated using connectionist temporal classification (CTC) posteriors. During training, we iteratively perform realignment to update the targets. We attribute the effectiveness of the iterative training to ITSE's two vital features. First, ITSE performs alignment using similarities between token and speech embeddings instead of frame-wise token classification posteriors. Second, ITSE uses speech embeddings that are aware of left context rather than global context. ITSE significantly outperforms CTC-based baselines in word alignment accuracy and is comparable to a GMM-HMM forced aligner. In short, ITSE is an accurate, lightweight text-to-speech alignment module implemented without expertise such as pronunciation lexica.
更多
查看译文
关键词
Training,Hidden Markov models,Acoustics,Task analysis,Electronic mail,Decoding,Neural networks,Automatic speech recognition,end-to-end,text-to-speech alignment
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要