Effective Joint Training Of Denoising Feature Space Transforms And Neural Network Based Acoustic Models

2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)(2017)

引用 24|浏览133
暂无评分
摘要
Neural Network (NN) based acoustic frontends, such as denoising autoencoders, are actively being investigated to improve the robustness of NN based acoustic models to various noise conditions. In recent work the joint training of such frontends with backend NNs has been shown to significantly improve speech recognition performance. In this paper, we propose an effective algorithm to jointly train such a denoising feature space transform and a NN based acoustic model with various kinds of data. Our proposed method first pretrains a Convolutional Neural Network (CNN) based denoising frontend and then jointly trains this frontend with a NN backend acoustic model. In the unsupervised pretraining stage, the frontend is designed to estimate clean log Mel-filterbank features from noisy log-power spectral input features. A subsequent multi-stage training of the proposed frontend, with the dropout technique applied only at the joint layer between the frontend and backend NNs, leads to significant improvements in the overall performance. On the Aurora-4 task, our proposed system achieves an average WER of 9.98%. This is a 9.0% relative improvement over one of the best reported speaker independent baseline system's performance. A final semi-supervised adaptation of the frontend NN, similar to feature space adaptation, reduces the average WER to 7.39%, a further relative WER improvement of 25%.
更多
查看译文
关键词
Speech recognition, neural network, CNN, joint training, denoising autoencoder
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要