Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

Sun Guangzhi
Sun Guangzhi
Rosenberg Andrew
Rosenberg Andrew

ICASSP, pp. 6699-6703, 2020.

Cited by: 10|Bibtex|Views72|Links
EI

Abstract:

Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard ...More

Code:

Data:

Your rating :
0

 

Tags
Comments