Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

ICASSP, pp. 6264-6268, 2020.

Cited by: 1|Bibtex|Views87|Links
EI
Keywords:
natural ttsF0 Frame Errorlatent dimensionfine-grained VAEinterpretable speechMore(24+)
Weibo:
The model consists of a hierarchical structure across different levels covering phone, word and utterance

Abstract:

This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser level ones. Additionally, it imposes hierarchical conditioning across all latent dimens...More

Code:

Data:

0
Introduction
  • Significant developments have taken place in the neural end-to-end text-to-speech (TTS) synthesis models for generating high fidelity speech with a simplified pipeline [1,2,3,4].
  • Efforts have been made to model and control these attributes by factorizing the latent attributes from observed attributes
  • Most of these works use latent representations at utterance level which captures the salient features of the utterance [6, 10,11,12], fine-grained prosody that are aligned with the phone sequence can be captured using techniques recently proposed in [13].
  • This model provides a localized prosody control that achieves more variability and higher robustness to speaker perturbations
Highlights
  • Significant developments have taken place in the neural end-to-end text-to-speech (TTS) synthesis models for generating high fidelity speech with a simplified pipeline [1,2,3,4]
  • Respecting the hierarchical nature of spoken language and aiming at interpretation of prosody at fine-scale such as F0 for a vowel, this paper aims to achieve disentangled control of each prosody attribute at different levels
  • To incorporate the dependency into inference, we extend the hierarchical structure described in the previous section further to include a conditional variational auto-encoder (VAE) according to an auto-regressive decomposition of the posterior
  • A fully-hierarchical model to achieve multilevel control of prosody attributes is proposed in this paper
  • The model consists of a hierarchical structure across different levels covering phone, word and utterance
  • A conditional VAE is applied at the phone and word-level which adopts a hierarchical structure across all latent dimensions
Methods
  • The proposed models are evaluated on the LibriTTS multi-speaker audiobook dataset [35] and the Blizzard Challenge 2013 single-speaker audiobook dataset [36].
  • LibriTTS includes approximately 585 hours of read English audiobooks at 24kHz sampling rate.
  • It covers a wide range of speakers, recording conditions and speaking styles.
  • The Blizzard Challenge 2013 dataset contains 147 hours US English speech with highly varying prosody, recorded by a female professional speaker.
  • To decrease the variance due to bad alignments, the authors exclude 50 samples at both margins
Conclusion
  • A fully-hierarchical model to achieve multilevel control of prosody attributes is proposed in this paper.
  • The model consists of a hierarchical structure across different levels covering phone, word and utterance.
  • A conditional VAE is applied at the phone and word-level which adopts a hierarchical structure across all latent dimensions.
  • Experimental results demonstrate improved interpretability by showing improved disentanglement, and the order of prosody attributes to be extracted is explained.
  • The difference in phone and word level control effects is analyzed
Summary
  • Introduction:

    Significant developments have taken place in the neural end-to-end text-to-speech (TTS) synthesis models for generating high fidelity speech with a simplified pipeline [1,2,3,4].
  • Efforts have been made to model and control these attributes by factorizing the latent attributes from observed attributes
  • Most of these works use latent representations at utterance level which captures the salient features of the utterance [6, 10,11,12], fine-grained prosody that are aligned with the phone sequence can be captured using techniques recently proposed in [13].
  • This model provides a localized prosody control that achieves more variability and higher robustness to speaker perturbations
  • Objectives:

    Respecting the hierarchical nature of spoken language and aiming at interpretation of prosody at fine-scale such as F0 for a vowel, this paper aims to achieve disentangled control of each prosody attribute at different levels.
  • Methods:

    The proposed models are evaluated on the LibriTTS multi-speaker audiobook dataset [35] and the Blizzard Challenge 2013 single-speaker audiobook dataset [36].
  • LibriTTS includes approximately 585 hours of read English audiobooks at 24kHz sampling rate.
  • It covers a wide range of speakers, recording conditions and speaking styles.
  • The Blizzard Challenge 2013 dataset contains 147 hours US English speech with highly varying prosody, recorded by a female professional speaker.
  • To decrease the variance due to bad alignments, the authors exclude 50 samples at both margins
  • Conclusion:

    A fully-hierarchical model to achieve multilevel control of prosody attributes is proposed in this paper.
  • The model consists of a hierarchical structure across different levels covering phone, word and utterance.
  • A conditional VAE is applied at the phone and word-level which adopts a hierarchical structure across all latent dimensions.
  • Experimental results demonstrate improved interpretability by showing improved disentanglement, and the order of prosody attributes to be extracted is explained.
  • The difference in phone and word level control effects is analyzed
Tables
  • Table1: Reconstruction performance results. 2d and 3d refers to the dimension of the latent space. 3-dimensional latent space is used for the conditional and the fully-hierarchical VAE. If not specified, the posterior is conditioned on the speaker embedding
  • Table2: MOS evaluation of speech generated with the phone/word level independent F0 sampling. When sampling at one level, the other is set to all zero to give neutral prosody
  • Table3: Variance ratio for different influencing control factors associated with a vowel on LibriTTS. DIP-VAE-I refers the model proposed in [<a class="ref-link" id="c24" href="#r24">24</a>] which essentially enforces the covariance matrix of the marginal posterior q(z) to be diagonal
  • Table4: Variance ratio for different influencing control factors associated with a vowel on the single-speaker audio book dataset
Download tables as Excel
Study subjects and analysis
samples: 50
F0 can be similarly measured using the average F0 estimated from an F0 tracker [37] among the frames in [n1, n2 − 1]. To decrease the variance due to bad alignments, we exclude 50 samples at both margins.

samples: 50
F0 can be similarly measured using the average F0 estimated from an F0 tracker [37] among the frames in [n1, n2 − 1]. To decrease the variance due to bad alignments, we exclude 50 samples at both margins. Finally, the mel-cepstral distortion (MCD), the F0 Frame Error (FFE) [38], which is a combination of the Gross Pitch Error (GPE) and the Voicing Decision Error (VDE), are used to quantify the reconstruction performance

samples: 100
To illustrate this improvement, a vowel was selected and its F0, energy and duration were measured with the method in Sec. 5. For each model, 100 samples were generated by drawing from a standard Gaussian distribution for one latent dimension while keeping other dimensions constant. Then, the standard deviations for measured attributes were

Reference
  • J. Sotelo, S. Mehri, K. Kumar, et al., “Char2wav: End-to-end speech synthesis.,” in Proc. Int. Conf. on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • Y. Wang, R. Skerry-Ryan, D. Stanton, et al., “Tacotron: Towards end-toend speech synthesis.,” in Proc. Interspeech, 2017, pp. 4006–4010.
    Google ScholarLocate open access versionFindings
  • W. Ping, K. Peng, A. Gibiansky, et al., “Deep Voice 3: 2000-speaker neural text-to-speech.,” in Proc. Int. Conf. on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • J. Shen, R. Pang, R. J. Weiss, et al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions.,” in Proc. ICASSP, 2018, pp. 4779–4783.
    Google ScholarLocate open access versionFindings
  • I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems, 2014.
    Google ScholarLocate open access versionFindings
  • W.-N. Hsu, Y. Zhang, R. J. Weiss, et al., “Hierarchical generative modeling for controllable speech synthesis,” in Proc. Int. Conf. on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • Y. Wang, D. Stanton, Y. Zhang, et al., “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proc. Int. Conf. on Machine Learning (ICML), 2018, pp. 5167–5176.
    Google ScholarLocate open access versionFindings
  • S. Ma, D. Mcduff, and Y. Song., “A generative adversarial network for style modeling in a text-to-speech system,” in Proc. Int. Conf. on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • M. Wagner and D. G. Watson, “Experimental and theoretical advances in prosody: A review,” in Language and Cognitive Processes, 2010, pp. 905–945.
    Google ScholarLocate open access versionFindings
  • W.-N. Hsu, Y. Zhang, R. J. Weiss, et al., “Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization,” in Proc. ICASSP, 2019.
    Google ScholarLocate open access versionFindings
  • E. Battenberg, S. Mariooryad, D. Stanton, et al., “Effective use of variational embedding capacity in expressive end-to-end speech synthesis,” arXiv: 1906.03402, 2019.
    Findings
  • R. Skerry-Ryan, E. Battenberg, Y. Xiao, et al., “Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron,” in Proc. Int. Conf. on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • Y. Lee and T. Kim, “Robust and fine-grained prosody control of end-toend speech synthesis,” in Proc. ICASSP, 2019.
    Google ScholarLocate open access versionFindings
  • J. Shen, R. Pang, R. J. Weiss, et al., “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” in Proc. ICASSP, 2018.
    Google ScholarLocate open access versionFindings
  • Y.-J. Zhang, S. Pan, L. He, and Z.-H. Ling, “Learning latent representations for style control and transfer in end-to-end speech synthesis,” in Proc. ICASSP, 2019, pp. 6945–6949.
    Google ScholarLocate open access versionFindings
  • G. E. Henter, J. Lorenzo-Trueba, X. Wang, and J. Yamagishi, “Deep encoder-decoder models for unsupervised learning of controllable speech synthesis,” arXiv:1807.11470, 2018.
    Findings
  • K. Akuzawa, Y. Iwasawa, and Y. Matsuo, “Expressive speech synthesis via modeling expressions with variational autoencoder,” in Proc. Interspeech, 2018, pp. 3067–3071.
    Google ScholarLocate open access versionFindings
  • A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with VQ-VAE-2,” arXiv: 1904.02882, 2019.
    Findings
  • Y.-A. Chung, Y. Wang, W.-N. Hsu, Y. Zhang, and R. Skerry-Ryan, “Semi-supervised training for improving data efficiency in end-to-end speech synthesis,” in Proc. ICASSP, 2019.
    Google ScholarLocate open access versionFindings
  • W.-N. Hsu, H. Tang, and J. Glass, “Unsupervised adaptation with interpretable disentangled representations for distant conversational speech recognition,” in Proc. Interspeech, 2018.
    Google ScholarLocate open access versionFindings
  • W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised learning of disentangled and interpretable representations from sequential data.,” in Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • I. Higgins, L. Matthey, A. Pal, et al., “Beta-VAE: Learning basic visual concepts with a constrained variational framework,” in Proc. Int. Conf. on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • H. Kim and A. Mnih, “Disentangling by factorising,” in Proc. Int. Conf. on Machine Learning (ICML), 2018, pp. 2649–2658.
    Google ScholarLocate open access versionFindings
  • A. Kumar, P. Sattigeri, and A. Balakrishnan, “Variational inference of disentangled latent concepts from unlabeled observations,” in Proc. Int. Conf. on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • F. Locatello, S. Bauer, M. Lucic, et al., “Challenging common assumptions in the unsupervised learning of disentangled representations,” in Proc. Int. Conf. on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • S. Narayanaswamy, T. B. Paige, J.-W. van de Meent, et al., “Learning disentangled representations with semi-supervised deep generative models,” in Advances in Neural Information Processing Systems, 2017, pp. 5925–5935.
    Google ScholarLocate open access versionFindings
  • P. K. Gyawali, Z. Li, S. Ghimire, and L. Wang, “Semi-supervised learning by disentangling and self-ensembling over stochastic latent space,” arXiv: 1907.09607, 2019.
    Findings
  • R. Habib, S. Mariooryad, M. Shannon, et al., “Semi-supervised generative modeling for controllable speech synthesis,” arXiv preprint arXiv:1910.01709, 2019.
    Findings
  • X. Chen, Y. Duan, R. Houthooft, et al., “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in Neural Information Processing Systems, 2016, pp. 2172–2180.
    Google ScholarLocate open access versionFindings
  • M. Mathieu, J. Zhao, P. Sprechmann, A. Ramesh, and Y. LeCun, “Disentangling factors of variation in deep representations using adversarial training,” in Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • B. Esmaeili, H. Wu, S. Jain, et al., “Structured disentangled representations,” in Proc. Int. Conf. on Artificial Intelligence and Statistics, 2019, pp. 2525–2534.
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • B. Uria, I. Murray, and H. Larochelle, “RNADE: The real-valued neural autoregressive density-estimator,” in Advances in Neural Information Processing Systems, 2013.
    Google ScholarLocate open access versionFindings
  • G. Papamakarios, T. Pavlakou, and I. Murray, “Masked autoregressive flow for density estimation,” in Advances in Neural Information Processing Systems, 2017, pp. 2338–2347.
    Google ScholarLocate open access versionFindings
  • H. Zen, V. Dang, R. Clark, et al., “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Proc. Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • S. King and V. Karaiskos, “The blizzard challenge 2013,” in Blizzard Challenge Workshop, 2013.
    Google ScholarFindings
  • A. de Cheveigneand H. Kawahara, “YIN, a fundamental frequency estimator for speech and music,” Journal of the Acoustical Society of America, vol. 111, no. 4, pp. 1917–1930, 2002.
    Google ScholarLocate open access versionFindings
  • W. Chu and A. Alwan, “Reducing f0 frame error of f0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend,” in Proc. ICASSP, 2009.
    Google ScholarLocate open access versionFindings
  • “Audio samples from “Fully-hierarchical Fine-grained Prosody Modelling for Interpretable Speech Synthesis”,” https://google.github.io/tacotron/publications/hierarchical_prosody.
    Findings
Your rating :
0

 

Tags
Comments