Combining Speaker Recognition and Metric Learning for Speaker-Dependent Representation Learning
INTERSPEECH(2019)
摘要
In this paper, we tackle automatic speaker verification under a text-independent setting. Speaker modelling is performed by a deep convolutional neural network on top of time-frequency speech representations. Convolutions performed over the time dimension provide the means for the model to take both short-term dependencies into account, given the nature of the learned filters which operate over short-windows, as well as long-term dependencies, since depth in a convolutional stack implies dependency of outputs across large portions of input samples. Additionally, various pooling strategies across the time dimension are compared so as to effectively map varying length recordings into fixed dimensional representations while simultaneously providing the neural network with an extra mechanism to model long-term dependencies. We finally propose a training scheme under which well-known metric learning approaches, namely triplet loss minimization, is performed along with speaker recognition in a multi-class classification setting. Evaluation on well-known datasets and comparisons with state-of-the-art benchmarks show that the proposed setting is effective in yielding speaker-dependent representations, thus is well-suited for voice biometrics downstream tasks.
更多查看译文
关键词
Speaker verification, metric learning, residual convolutional neural networks, attentive features pooling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络