Far-Field Speech Enhancement Using Heteroscedastic Autoencoder for Improved Speech Recognition

INTERSPEECH(2019)

引用 3|浏览1
暂无评分
摘要
Automatic speech recognition (ASR) systems trained on clean speech do not perform well in far-field scenario. Degradation in word error rate (WER) can be as large as 40% in this mismatched scenerio. Typically, speech enhancement is applied to map speech from far-field condition to clean condition using a neural network, commonly known as denoising autoencoder (DA). Such speech enhancement technique has shown significant improvement in ASR accuracy. It is a common pratice to use mean-square error (MSE) loss to train DA which is based on regression model with residual noise modeled by zero-mean and constant co-variance Gaussian distribution. However, both these assumptions are not optimal, especially in highly non-stationary noisy and far-field scenario. Here, we propose a more generalized loss based on non-zero mean and heteroscedastic co-variance distribution for the residual variables. On the top, we present several novel DA architectures that are more suitable for the heteroscedastic loss. It is shown that the proposed methods outperform the conventional DA and MSE loss by a large margin. We observe relative improvement of 7.31% in WER compared to conventional DA and overall, a relative improvement of 14.4% compared to mismatched train and test scenerio.
更多
查看译文
关键词
distant speech recognition, parallel data, speech enhancement, autoencoder, homoscedastic, heteroscedastic
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要