One-Shot Voice Conversion with Speaker-Agnostic StarGAN.

Interspeech(2021)

引用 0|浏览22
暂无评分
摘要
In this work, we propose a variant of STARGAN for many-to-many voice conversion (VC) conditioned on the d-vectors for short-duration (2-15 seconds) speech. We make several modifications to the STARGAN training and employ new network architectures. We employ a transformer encoder in the discriminator network, and we apply the discriminator loss to the cycle consistency and identity samples in addition to the generated (fake) samples. Instead of classifying the samples as either real or fake, our discriminator tries to predict the categorical speaker class, where a fake class is added for the generated samples. Furthermore, we employ a reverse gradient layer after the generator's encoder and use an auxiliary classifier to remove the speaker's information from the encoded representation. We show that our method yields better results than the baseline method in objective and subjective evaluations in terms of voice conversion quality. Moreover, we provide an ablation study and show each component's influence on speaker similarity.
更多
查看译文
关键词
non-parallel voice conversion,many-to-many voice conversion,one-shot voice conversion,generative adverserial networks,stargan,cyclegan,non-parallel training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要