Speaker Identification for Household Scenarios with Self-Attention and Adversarial Training.

INTERSPEECH(2020)

引用 8|浏览66
暂无评分
摘要
Speaker identification based on voice input is a fundamental capability in speech processing enabling versatile downstream applications, such as personalization and authentication. With the advent of deep learning, most state-of-the-art methods apply machine learning techniques and derive acoustic embeddings from utterances with convolutional neural networks (CNNs) and recurrent neural networks (RNNs). This paper addresses two inherent limitations of current approaches. First, voice characteristics over long time spans might not be fully captured by CNNs and RNNs, as they are designed to focus on local feature extraction and adjacent dependencies modeling, respectively. Second, complex deep learning models can be fragile with regard to subtle but intentional changes in model inputs, also known as adversarial perturbations. To distill informative global acoustic embedding representations from utterances and be robust to adversarial perturbations, we propose a Self-Attentive Adversarial Speaker-Identification method (SAASI). In experiments on the VCTK dataset, SAASI significantly outperforms four state-of-the-art baselines in identifying both known and new speakers.
更多
查看译文
关键词
Self-attention, adversarial training, speaker identification in households
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要