Reducing the computational complexity for whole word models

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)(2017)

引用 10|浏览70
暂无评分
摘要
In a previous study, we demonstrated the feasibility to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. In that system, we model about 100,000 words directly using deep bi-directional LSTM RNNs. To alleviate the data sparsity problem for word models, we train the model on 125,000 hours of semi-supervised acoustic training data. The resulting model works very well as an end-to-end all-neural speech recognition model without the use of any language model removing the need to decode. However, the very large output layer increases the computational cost substantially. In this work we address this issue by adding TDNN (Time Delay Neural Network) layers that reduce the frame rate to 120ms for the output layer. The TDNN layers are interspersed with the LSTM layers, gradually reducing the frame rate from 10ms to 120ms. The new model reduces the computational cost by 60% while improving the word error rate by 6% relative. Compared to a traditional LVCSR system, the whole word speech recognizer uses about the same CPU cycles and can easily be parallelized across CPU cores or run on GPUs.
更多
查看译文
关键词
computational complexity,competitive vocabulary continuous speech recognition system,greatly simplified vocabulary continuous speech recognition system,large vocabulary continuous speech recognition system,acoustic units,data sparsity problem,semisupervised acoustic training data,time delay neural network,whole word models,deep bidirectional LSTM RNNs,end-to-end all-neural speech recognition model,word error rate,LSTM layers,TDNN layers,time 10.0 ms to 120.0 ms,time 125000.0 hour,time 120.0 ms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要