VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
arxiv(2024)
摘要
We introduce VoiceCraft, a token infilling neural codec language model, that
achieves state-of-the-art performance on both speech editing and zero-shot
text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft
employs a Transformer decoder architecture and introduces a token rearrangement
procedure that combines causal masking and delayed stacking to enable
generation within an existing sequence. On speech editing tasks, VoiceCraft
produces edited speech that is nearly indistinguishable from unedited
recordings in terms of naturalness, as evaluated by humans; for zero-shot TTS,
our model outperforms prior SotA models including VALLE and the popular
commercial model XTTS-v2. Crucially, the models are evaluated on challenging
and realistic datasets, that consist of diverse accents, speaking styles,
recording conditions, and background noise and music, and our model performs
consistently well compared to other models and real recordings. In particular,
for speech editing evaluation, we introduce a high quality, challenging, and
realistic dataset named RealEdit. We encourage readers to listen to the demos
at https://jasonppy.github.io/VoiceCraft_web.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要