Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)(2021)

引用 34|浏览88
暂无评分
摘要
End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model’s accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied.
更多
查看译文
关键词
end-to-end models, RNN-T, limited context, lattices, beam search
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要