Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior
ICASSP, pp. 6699-6703, 2020.
Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard ...More
PPT (Upload PPT)