Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders

INTERSPEECH(2019)

引用 27|浏览15
暂无评分
摘要
This paper investigates real-time high-fidelity neural text-to=speech (TTS) systems. For real-time neural vocoders, WaveGlow is introduced and single Gaussian (SG) WaveRNN is proposed. The proposed SG-WaveRNN can predict continuous valued speech waveforms with half the synthesis time compared with vanilla WaveRNN with dual-softmax for 16 bit audio prediction. Additionally, a sequence-to-sequence (seq2seq) acoustic model (AM) for pitch accent languages, such as Japanese, is investigated by introducing Tacotron 2 architecture. In the seq2seq AM, full-context labels extracted from a text analyzer are used as input and they are directly converted into mel-spectrograms. The results of subjective experiment using a Japanese female corpus indicate that the proposed SG-WaveRNN vocoder with noise shaping can synthesize high-quality speech waveforms and real-time high-fidelity neural TTS systems can be realized with the seq2seq AM and WaveGlow or SG-WaveRNN vocoders. Especially, the seq2seq AM and WaveGlow vocoder conditioned on mel-spectrograms with simple PyTorch implementations can be realized with real-time factors 0.06 and 0.10 for inference using a GPU.
更多
查看译文
关键词
speech synthesis, neural vocoder, WaveGlow, WaveRNN, text-to-speech
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要