Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders

Takuma Okamoto,Tomoki Toda,Yoshinori Shiga,Hisashi Kawai

INTERSPEECH（2019）

引用 27|浏览15

暂无评分

摘要

This paper investigates real-time high-fidelity neural text-to=speech (TTS) systems. For real-time neural vocoders, WaveGlow is introduced and single Gaussian (SG) WaveRNN is proposed. The proposed SG-WaveRNN can predict continuous valued speech waveforms with half the synthesis time compared with vanilla WaveRNN with dual-softmax for 16 bit audio prediction. Additionally, a sequence-to-sequence (seq2seq) acoustic model (AM) for pitch accent languages, such as Japanese, is investigated by introducing Tacotron 2 architecture. In the seq2seq AM, full-context labels extracted from a text analyzer are used as input and they are directly converted into mel-spectrograms. The results of subjective experiment using a Japanese female corpus indicate that the proposed SG-WaveRNN vocoder with noise shaping can synthesize high-quality speech waveforms and real-time high-fidelity neural TTS systems can be realized with the seq2seq AM and WaveGlow or SG-WaveRNN vocoders. Especially, the seq2seq AM and WaveGlow vocoder conditioned on mel-spectrograms with simple PyTorch implementations can be realized with real-time factors 0.06 and 0.10 for inference using a GPU.

查看译文

关键词

speech synthesis, neural vocoder, WaveGlow, WaveRNN, text-to-speech

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要