Joint Optimization of Classification and Clustering for Deep Speaker Embedding
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)(2019)
摘要
This paper proposes a method to train deep speaker embed-dings end-to-end that jointly optimizes classification and clustering. A large margin softmax loss is used to reduce classification errors. A novel large margin Gaussian mixture loss is proposed to improve clustering. With the joint optimization, the learned embeddings capture segment-level acoustic representation from variable-length speech segments to discriminate between speakers and to replicate densities of speaker clusters. We compare performance with alternative methods on large-scale text-independent speaker recognition dataset VoxCeleb1 [1] and observe that it outperforms those methods significantly, achieving new state-of-the-art results on the dataset. Moreover, because of the joint optimization, this method exhibits faster and better convergence than using classification loss alone. Our results suggest great potential of joint optimization of classification and clustering for speaker verification and identification.
更多查看译文
关键词
speaker embedding,multi-task learning,additive margin softmax loss,Gaussian mixture loss,intra-class variation loss
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络