SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems

IEEE Signal Process. Lett.(2023)

引用 1|浏览3
暂无评分
摘要
This letter proposes an effective speaker-conditioning method that is applicable to zero-shot multi-speaker text-to-speech (ZSM-TTS) systems. Based on the inductive bias in the speech generation task, in which local context information in text/phoneme sequences heavily affect the speaker characteristics of the output speech, we propose a Speaker-Conditional Convolutional Neural Network (SC-CNN) for the ZSM-TTS task. SC-CNN first predicts convolutional kernels from each learned speaker embedding, then applies 1-D convolutions to phoneme sequences with the predicted kernels. It utilizes the aforementioned inductive bias and effectively models the characteristic of speech by providing the speaker-specific local context in phonetic domain. We also build both FastSpeech2 and VITS-based ZSM-TTS systems to verify its superiority over conventional speaker conditioning methods. The results confirm that the models with SC-CNN outperform the recent ZSM-TTS models in terms of both subjective and objective measurements.
更多
查看译文
关键词
Generalization,text-to-speech,zero-shot,multi-speaker,style transfer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要