Parallel Tacotron: Non-Autoregressive and Controllable TTS

Isaac Elias
Isaac Elias
Jonathan Shen
Jonathan Shen
Ye Jia
Ye Jia
Cited by: 0|Bibtex|Views24|Links
Keywords:
neural endNeural TTSGaussian mixture modelgated linear unitmodern parallelMore(19+)
Weibo:
This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder

Abstract:

Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, ...More

Code:

Data:

0
Introduction
  • Neural end-to-end text-to-speech (TTS) has been researched extensively in the last few years [1,2,3,4].
  • Tacotron 2 uses an autoregressive uni-directional long short-term memory (LSTM)-based decoder with the soft attention mechanism [7].
  • This architecture makes both training and inference less efficient on modern parallel hardware than fully feed-forward architectures.
  • Transformer TTS [4] addresses the inefficiency during training, it is still inefficient during inference and has the potential for robustness errors due to the autoregressive decoder and the attention mechanism
Highlights
  • Neural end-to-end text-to-speech (TTS) has been researched extensively in the last few years [1,2,3,4]
  • This paper presents a non-autoregressive neural TTS model augmented by a variational autoencoder (VAE)
  • The model, called Parallel Tacotron1, has the following properties which we found to be helpful to synthesize highly natural speech efficiently; (1) non-autoregressive architecture based on self-attention with lightweight convolutions [31], (2) iterative mel-spectrogram loss [32], (3) VAE-style residual encoder [28, 30]
  • Parallel Tacotron without VAE was significantly worse than the baseline Tacotron 2 in both Mean Opinion Score (MOS) and preference
  • A non-autoregressive neural TTS model called Parallel Tacotron was proposed. It matched the baseline Tacotron 2 in naturalness and offered significantly faster inference than Tacotron 2. We showed that both variational residual encoders and an iterative loss improved the naturalness, and the use of lightweight convolutions as self-attention improved both naturalness and efficiency
Methods
  • The authors used a proprietary speech dataset containing 405 hours of speech data; 347,872 utterances including 45 speakers in 3 English accents (32 US English speakers, 8 British English, and 5 Australian English speakers).
  • The Parallel Tacotron models were trained with Nesterov momentum optimization and = 0.99.
  • All models were trained for 120K steps with global gradient norm clipping of 0.2 and a batch size of 2,048 using Google Cloud TPUs. Training took less than one day
Results
  • The first experiment evaluated the effect of the iterative loss.
  • Tables 1 and 2 show the experimental result.
  • There was no significant difference in MOS and preference against Tacotron 2, the direct comparison between models with and without the iterative loss indicate that it can give small improvement.
  • The second experiment evaluated the impact of VAEs. Tables 5 and 6 show the experimental results.
  • The introduction of global VAE made it comparable to the baseline Tacotron 2 in both evaluations.
  • The introduction of fine-grained phoneme-level VAE further boosted the naturalness
Conclusion
  • A non-autoregressive neural TTS model called Parallel Tacotron was proposed. It matched the baseline Tacotron 2 in naturalness and offered significantly faster inference than Tacotron 2.
  • The authors showed that both variational residual encoders and an iterative loss improved the naturalness, and the use of lightweight convolutions as self-attention improved both naturalness and efficiency
Summary
  • Introduction:

    Neural end-to-end text-to-speech (TTS) has been researched extensively in the last few years [1,2,3,4].
  • Tacotron 2 uses an autoregressive uni-directional long short-term memory (LSTM)-based decoder with the soft attention mechanism [7].
  • This architecture makes both training and inference less efficient on modern parallel hardware than fully feed-forward architectures.
  • Transformer TTS [4] addresses the inefficiency during training, it is still inefficient during inference and has the potential for robustness errors due to the autoregressive decoder and the attention mechanism
  • Methods:

    The authors used a proprietary speech dataset containing 405 hours of speech data; 347,872 utterances including 45 speakers in 3 English accents (32 US English speakers, 8 British English, and 5 Australian English speakers).
  • The Parallel Tacotron models were trained with Nesterov momentum optimization and = 0.99.
  • All models were trained for 120K steps with global gradient norm clipping of 0.2 and a batch size of 2,048 using Google Cloud TPUs. Training took less than one day
  • Results:

    The first experiment evaluated the effect of the iterative loss.
  • Tables 1 and 2 show the experimental result.
  • There was no significant difference in MOS and preference against Tacotron 2, the direct comparison between models with and without the iterative loss indicate that it can give small improvement.
  • The second experiment evaluated the impact of VAEs. Tables 5 and 6 show the experimental results.
  • The introduction of global VAE made it comparable to the baseline Tacotron 2 in both evaluations.
  • The introduction of fine-grained phoneme-level VAE further boosted the naturalness
  • Conclusion:

    A non-autoregressive neural TTS model called Parallel Tacotron was proposed. It matched the baseline Tacotron 2 in naturalness and offered significantly faster inference than Tacotron 2.
  • The authors showed that both variational residual encoders and an iterative loss improved the naturalness, and the use of lightweight convolutions as self-attention improved both naturalness and efficiency
Tables
  • Table1: Subjective evaluations of Parallel Tacotron with and without the iterative loss. Positive preference scores indicate that the corresponding Parallel Tacotron model was rated better than the reference Tacotron 2
  • Table2: Subjective preference scores between Parallel Tacotron with and without the iterative loss. Positive preference scores indicate that the corresponding model with the iterative loss was rated better than the one without the iterative loss
  • Table3: Subjective evaluations of Parallel Tacotron with different self-attention (with Global VAE and iterative loss). Positive preference scores indicate that the corresponding Parallel Tacotron was rated better than Tacotron 2
  • Table4: Subjective preference score between Parallel Tacotron using LConv and Transformer-based self-attention. Positive preference scores indicate that LConv was rated better than Transformer
  • Table5: Subjective evaluations of Parallel Tacotron with different VAEs. Positive preference scores indicate that the corresponding Parallel Tacotron was rated better than Tacotron 2
  • Table6: Subjective preference scores between Parallel Tacotron using the global and fine-grained VAEs. Positive preference scores indicate that left models were rated better than the right ones
  • Table7: Subjective preference scores between synthetic and natural speech. The preference scores become positive when synthetic speech was rated better than natural one
  • Table8: Inference speed to predict mel-spectrograms for ∼20-second long utterance on a TPU (aggregated over ten trials)
Download tables as Excel
Funding
  • Although there was no significant difference in MOS and preference against Tacotron 2, the direct comparison between models with and without the iterative loss indicate that it can give small improvement
Study subjects and analysis
speakers: 45
3.1. Training Setup

We used a proprietary speech dataset containing 405 hours of speech data; 347,872 utterances including 45 speakers in 3 English accents (32 US English speakers, 8 British English, and 5 Australian English speakers)
. The Parallel Tacotron models were trained with Nesterov3 momentum optimization and = 0.99

speakers: 45
Training Setup. We used a proprietary speech dataset containing 405 hours of speech data; 347,872 utterances including 45 speakers in 3 English accents (32 US English speakers, 8 British English, and 5 Australian English speakers). The Parallel Tacotron models were trained with Nesterov3 momentum optimization and = 0.99

US English speakers: 10
These sentences were different from the training data and was used in the previous papers. They were synthesized using 10 US English speakers (5 male & 5 female) in a round-robin style (100 sentences per speaker). The naturalness of the synthesized speech was evaluated through subjective listening tests, including 5-scale Mean Opinion Score (MOS) tests and side-by-side preference tests

Reference
  • J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. C. Courville, and Y. Bengio, “Char2Wav: End-to-End Speech Synthesis,” in Proc. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Wang, RJ Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards End-to-End Speech Synthesis,” in Proc. Interspeech, 2017, pp. 4006–4010.
    Google ScholarLocate open access versionFindings
  • W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep Voice 3: 2000-Speaker Neural Text-toSpeech,” in Proc. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural Speech Synthesis with Transformer Network,” in Proc. AAAI, 2019, vol. 33, pp. 6706–6713.
    Google ScholarLocate open access versionFindings
  • A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” arXiv:1609.03499, 2016.
    Findings
  • J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, RJ Skerrv-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” in Proc. ICASSP, 2018.
    Google ScholarLocate open access versionFindings
  • D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in Proc. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • R. J. Williams and D. Zipser, “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks,” Neural Computation, vol. 1, no. 2, pp. 270–280, 1989.
    Google ScholarLocate open access versionFindings
  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks,” in Proc. NIPS, 2015, pp. 1171–1179.
    Google ScholarLocate open access versionFindings
  • A. Goyal, A. Lamb, Y. Zhang, S. Zhang, A. Courville, and Y. Bengio, “Professor Forcing: A New Algorithm for Training Recurrent Networks,” in Proc. NIPS, 2016, pp. 4601–4609.
    Google ScholarLocate open access versionFindings
  • M. He, Y. Deng, and L. He, “Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS,” in Proc. Interspeech, 2019, pp. 1293–1297.
    Google ScholarLocate open access versionFindings
  • Y. Zheng, J. Tao, Z. Wen, and J. Yi, “Forward–backward decoding sequence for regularizing end-to-end tts,” IEEE/ACM Trans. Audio Speech & Lang. Process., vol. 27, no. 12, pp. 2067–2079, 2019.
    Google ScholarLocate open access versionFindings
  • H. Guo, F. K. Soong, L. He, and L. Xie, “A New GAN-Based End-to-End TTS Training Algorithm,” in Proc. Interspeech, 2019, pp. 1288–1292.
    Google ScholarLocate open access versionFindings
  • E. Battenberg, RJ Skerry-Ryan, S. Mariooryad, D. Stanton, D. Kao, M. Shannon, and T. Bagby, “Location-relative attention mechanisms for robust long-form speech synthesis,” in Proc. ICASSP, 2020, pp. 6194–6198.
    Google ScholarLocate open access versionFindings
  • Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech: Fast, Robust and Controllable Text to Speech,” arXiv:1905.09263, 2019.
    Findings
  • Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” arXiv:2006.04558, 2020.
    Findings
  • S. Beliaev, Y. Rebryk, and B. Ginsburg, “TalkNet: Fully-Convolutional Non-Autoregressive Speech Synthesis Model,” arXiv:2005.05514, 2020.
    Findings
  • D. Lim, W. Jang, H. Park, B. Kim, and J. Yoon, “JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment,” arXiv:2005.07799, 2020.
    Findings
  • Z. Zeng, J. Wang, N. Cheng, T. Xia, and J. Xiao, “AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment,” in Proc. ICASSP, 2020, pp. 6714–6718.
    Google ScholarLocate open access versionFindings
  • C. Miao, S. Liang, M. Chen, J. Ma, S. Wang, and J. Xiao, “Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow,” in Proc. ICASSP, 2020, pp. 7209–7213.
    Google ScholarLocate open access versionFindings
  • J. Donahue, S. Dieleman, M. Bińkowski, E. Elsen, and K. Simonyan, “End-to-End Adversarial Text-to-Speech,” arXiv:2006.03575, 2020.
    Findings
  • A. Lańcucki, “FastPitch: Parallel Text-to-speech with Pitch Prediction,” arXiv:2006.06873, 2020.
    Findings
  • C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, K. Kang, G. Lei, D. Su, and D. Yu, “DurIAN: Duration informed attention network for multimodal synthesis,” arXiv:1909.01700, 2019.
    Findings
  • J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, “Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling,” arXiv:2010.04301, 2020.
    Findings
  • H. Zen, K. Tokuda, and A. Black, “Statistical Parametric Speech Synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064, 2009.
    Google ScholarLocate open access versionFindings
  • H. Zen, A. Senior, and M. Schuster, “Statistical Parametric Speech Synthesis Using Deep Neural Networks,” in Proc. ICASSP, 2013, pp. 7962–7966.
    Google ScholarLocate open access versionFindings
  • Y. Wang, D. Stanton, Y. Zhang, RJ Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis,” in Proc. ICML, 2018, pp. 5167–5176.
    Google ScholarLocate open access versionFindings
  • W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical Generative Modeling for Controllable Speech Synthesis,” in Proc. ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Zhang, R. J. Weiss, H. Zen, Wu Y., Z. Chen, RJ Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning,” in Proc. Interspeech, 2019, pp. 2080–2084.
    Google ScholarLocate open access versionFindings
  • G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu, “FullyHierarchical Fine-Grained Prosody Modeling for Interpretable Speech Synthesis,” arXiv:2002.03785, 2020.
    Findings
  • F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli, “Pay Less Attention with Lightweight and Dynamic Convolutions,” in Proc. ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • A. Tjandra, C. Liu, F. Zhang, X. Zhang, Y. Wang, G. Synnaeve, S. Nakamura, and G. Zweig, “DEJA-VU: Double Feature Presentation and Iterated Loss in Deep Transformer Networks,” in Proc. ICASSP, 2020, pp. 6899–6903.
    Google ScholarLocate open access versionFindings
  • J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher, “NonAutoregressive Neural Machine Translation,” arXiv:1711.02281, 2017.
    Findings
  • J. Lee, E. Mansimov, and K. Cho, “Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement,” arXiv:1802.06901, 2018.
    Findings
  • R. Shu, J. Lee, H. Nakayama, and K. Cho, “Latent-Variable NonAutoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior,” in Proc. AAAI, 2020.
    Google ScholarLocate open access versionFindings
  • J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,” 2016.
    Google ScholarFindings
  • D. Talkin and C. W. Wightman, “The Aligner: Text to Speech Alignment using Markov Models and a Pronunciation Dictionary,” in ESCA/IEEE SSW2, 1994.
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in Proc. NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski, “An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution,” in Proc. NeurIPS, 2018, pp. 9605–9616.
    Google ScholarLocate open access versionFindings
  • A. Graves, “Generating Sequences with Recurrent Neural Networks,” arXiv:1308.0850, 2013.
    Findings
  • RJ Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. Clark, and R. A. Saurous, “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron,” in Proc. ICML, 2018.
    Google ScholarLocate open access versionFindings
  • N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient Neural Audio Synthesis,” in Proc. ICML, 2018, pp. 2410–2419.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments