Improvements to speaker adaptive training of deep neural networks

SLT(2014)

引用 47|浏览97
暂无评分
摘要
Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on word error rates (WERs). In this paper, we present different methods to further improve and extend SAT-DNN. First, we conduct detailed analysis to investigate i-vector extractor training and flexible feature fusion. Second, the SAT-DNN approach is extended to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNN-based feature extraction. Third, for transcribing multimedia data, we enrich the i-vector representation with global speaker attributes (age, gender, etc.) obtained automatically from video signals. On a collection of instructional videos, incorporation of the additional visual features is observed to boost the recognition accuracy of SAT-DNN.
更多
查看译文
关键词
video signal processing,gaussian mixture acoustic models,deep neural networks,multimedia systems,sat-dnn model,gmm,convolutional neural network acoustic modeling,speech recognition,cnn acoustic modeling,learning (artificial intelligence),speaker i-vectors,feature learning,mixture models,word error rates,speaker recognition,bottleneck feature generation,flexible feature fusion,acoustic signal processing,feature extraction,visual features,instructional videos,gaussian processes,multilingual dnn-based feature extraction,bnf generation,natural language processing,global speaker attributes,speaker adaptive training,i-vector extractor training,multimedia data,neural nets,sensor fusion,video signals,wer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要