Audio-visual deep learning for noise robust speech recognition
ICASSP(2013)
摘要
Deep belief networks (DBN) have shown impressive improvements over Gaussian mixture models for automatic speech recognition. In this work we use DBNs for audio-visual speech recognition; in particular, we use deep learning from audio and visual features for noise robust speech recognition. We test two methods for using DBNs in a multimodal setting: a conventional decision fusion method that combines scores from single-modality DBNs, and a novel feature fusion method that operates on mid-level features learned by the single-modality DBNs. On a continuously spoken digit recognition task, our experiments show that these methods can reduce word error rate by as much as 21% relative over a baseline multi-stream audio-visual GMM/HMM system.
更多查看译文
关键词
audio visual deep learning,belief networks,feature fusion method,audio-visual speech recognition,speech recognition,word error rate,learning (artificial intelligence),deep belief networks,dbn,gaussian distribution,multistream audio visual gmm/hmm system,noise robust speech recognition,audio visual speech recognition,decision fusion method,noise robustness,gaussian mixture models,hidden markov models,automatic speech recognition,learning artificial intelligence,acoustics,speech,visualization,noise measurement
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络