Mixture Representation Learning for Deep Speaker Embedding

IEEE/ACM Transactions on Audio, Speech, and Language Processing(2022)

引用 7|浏览0
暂无评分
摘要
How to effectively convert a sequence of variable-length acoustic features to a fixed-dimension representation has always been a research focus in speaker recognition. In state-of-the-art speaker recognition systems, the conversion is implemented by concatenating the mean and the standard deviation of a sequence of frame-level features. However, a single mean and a single standard deviation are limited descriptive statistics for an acoustic sequence even with powerful feature extractors such as convolutional neural networks. In this paper, we propose a novel statistics pooling method that can produce more descriptive statistics through a mixture representation. Our approach is inspired by the expectation–maximization (EM) algorithm in Gaussian mixture models (GMMs). Instead of using traditional GMM style alignment, we novelly leverage modern deep learning tools to produce a more powerful mixture representation. The novelty includes: (1) unlike GMMs, the mixture assignments are determined by an attention network instead of the Euclidean distances between the frame-level features and explicit centers; (2) instead of using a single frame as input to the attention network, contextual frames are included to smooth out attention transition; and (3) soft-attention assignments are replaced by hard-attention assignments via the Gumbel-Softmax with straight-through estimators. With the proposed attention mechanism, we obtained a 13.7% relative improvement over vanilla mean and standard deviation pooling in the VOiCES19-eval set.
更多
查看译文
关键词
Speaker recognition,deep neural networks,attention models,statistics pooling,Gumbel-Softmax
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要