Parallel Tacotron: Non-Autoregressive and Controllable TTS
Keywords:
Weibo:
Abstract:
Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, ...More
Code:
Data:
Introduction
- Neural end-to-end text-to-speech (TTS) has been researched extensively in the last few years [1,2,3,4].
- Tacotron 2 uses an autoregressive uni-directional long short-term memory (LSTM)-based decoder with the soft attention mechanism [7].
- This architecture makes both training and inference less efficient on modern parallel hardware than fully feed-forward architectures.
- Transformer TTS [4] addresses the inefficiency during training, it is still inefficient during inference and has the potential for robustness errors due to the autoregressive decoder and the attention mechanism
Highlights
- Neural end-to-end text-to-speech (TTS) has been researched extensively in the last few years [1,2,3,4]
- This paper presents a non-autoregressive neural TTS model augmented by a variational autoencoder (VAE)
- The model, called Parallel Tacotron1, has the following properties which we found to be helpful to synthesize highly natural speech efficiently; (1) non-autoregressive architecture based on self-attention with lightweight convolutions [31], (2) iterative mel-spectrogram loss [32], (3) VAE-style residual encoder [28, 30]
- Parallel Tacotron without VAE was significantly worse than the baseline Tacotron 2 in both Mean Opinion Score (MOS) and preference
- A non-autoregressive neural TTS model called Parallel Tacotron was proposed. It matched the baseline Tacotron 2 in naturalness and offered significantly faster inference than Tacotron 2. We showed that both variational residual encoders and an iterative loss improved the naturalness, and the use of lightweight convolutions as self-attention improved both naturalness and efficiency
Methods
- The authors used a proprietary speech dataset containing 405 hours of speech data; 347,872 utterances including 45 speakers in 3 English accents (32 US English speakers, 8 British English, and 5 Australian English speakers).
- The Parallel Tacotron models were trained with Nesterov momentum optimization and = 0.99.
- All models were trained for 120K steps with global gradient norm clipping of 0.2 and a batch size of 2,048 using Google Cloud TPUs. Training took less than one day
Results
- The first experiment evaluated the effect of the iterative loss.
- Tables 1 and 2 show the experimental result.
- There was no significant difference in MOS and preference against Tacotron 2, the direct comparison between models with and without the iterative loss indicate that it can give small improvement.
- The second experiment evaluated the impact of VAEs. Tables 5 and 6 show the experimental results.
- The introduction of global VAE made it comparable to the baseline Tacotron 2 in both evaluations.
- The introduction of fine-grained phoneme-level VAE further boosted the naturalness
Conclusion
- A non-autoregressive neural TTS model called Parallel Tacotron was proposed. It matched the baseline Tacotron 2 in naturalness and offered significantly faster inference than Tacotron 2.
- The authors showed that both variational residual encoders and an iterative loss improved the naturalness, and the use of lightweight convolutions as self-attention improved both naturalness and efficiency
Summary
Introduction:
Neural end-to-end text-to-speech (TTS) has been researched extensively in the last few years [1,2,3,4].- Tacotron 2 uses an autoregressive uni-directional long short-term memory (LSTM)-based decoder with the soft attention mechanism [7].
- This architecture makes both training and inference less efficient on modern parallel hardware than fully feed-forward architectures.
- Transformer TTS [4] addresses the inefficiency during training, it is still inefficient during inference and has the potential for robustness errors due to the autoregressive decoder and the attention mechanism
Methods:
The authors used a proprietary speech dataset containing 405 hours of speech data; 347,872 utterances including 45 speakers in 3 English accents (32 US English speakers, 8 British English, and 5 Australian English speakers).- The Parallel Tacotron models were trained with Nesterov momentum optimization and = 0.99.
- All models were trained for 120K steps with global gradient norm clipping of 0.2 and a batch size of 2,048 using Google Cloud TPUs. Training took less than one day
Results:
The first experiment evaluated the effect of the iterative loss.- Tables 1 and 2 show the experimental result.
- There was no significant difference in MOS and preference against Tacotron 2, the direct comparison between models with and without the iterative loss indicate that it can give small improvement.
- The second experiment evaluated the impact of VAEs. Tables 5 and 6 show the experimental results.
- The introduction of global VAE made it comparable to the baseline Tacotron 2 in both evaluations.
- The introduction of fine-grained phoneme-level VAE further boosted the naturalness
Conclusion:
A non-autoregressive neural TTS model called Parallel Tacotron was proposed. It matched the baseline Tacotron 2 in naturalness and offered significantly faster inference than Tacotron 2.- The authors showed that both variational residual encoders and an iterative loss improved the naturalness, and the use of lightweight convolutions as self-attention improved both naturalness and efficiency
Tables
- Table1: Subjective evaluations of Parallel Tacotron with and without the iterative loss. Positive preference scores indicate that the corresponding Parallel Tacotron model was rated better than the reference Tacotron 2
- Table2: Subjective preference scores between Parallel Tacotron with and without the iterative loss. Positive preference scores indicate that the corresponding model with the iterative loss was rated better than the one without the iterative loss
- Table3: Subjective evaluations of Parallel Tacotron with different self-attention (with Global VAE and iterative loss). Positive preference scores indicate that the corresponding Parallel Tacotron was rated better than Tacotron 2
- Table4: Subjective preference score between Parallel Tacotron using LConv and Transformer-based self-attention. Positive preference scores indicate that LConv was rated better than Transformer
- Table5: Subjective evaluations of Parallel Tacotron with different VAEs. Positive preference scores indicate that the corresponding Parallel Tacotron was rated better than Tacotron 2
- Table6: Subjective preference scores between Parallel Tacotron using the global and fine-grained VAEs. Positive preference scores indicate that left models were rated better than the right ones
- Table7: Subjective preference scores between synthetic and natural speech. The preference scores become positive when synthetic speech was rated better than natural one
- Table8: Inference speed to predict mel-spectrograms for ∼20-second long utterance on a TPU (aggregated over ten trials)
Funding
- Although there was no significant difference in MOS and preference against Tacotron 2, the direct comparison between models with and without the iterative loss indicate that it can give small improvement
Study subjects and analysis
speakers: 45
3.1. Training Setup
We used a proprietary speech dataset containing 405 hours of speech data; 347,872 utterances including 45 speakers in 3 English accents (32 US English speakers, 8 British English, and 5 Australian English speakers). The Parallel Tacotron models were trained with Nesterov3 momentum optimization and = 0.99
We used a proprietary speech dataset containing 405 hours of speech data; 347,872 utterances including 45 speakers in 3 English accents (32 US English speakers, 8 British English, and 5 Australian English speakers). The Parallel Tacotron models were trained with Nesterov3 momentum optimization and = 0.99
speakers: 45
Training Setup. We used a proprietary speech dataset containing 405 hours of speech data; 347,872 utterances including 45 speakers in 3 English accents (32 US English speakers, 8 British English, and 5 Australian English speakers). The Parallel Tacotron models were trained with Nesterov3 momentum optimization and = 0.99
US English speakers: 10
These sentences were different from the training data and was used in the previous papers. They were synthesized using 10 US English speakers (5 male & 5 female) in a round-robin style (100 sentences per speaker). The naturalness of the synthesized speech was evaluated through subjective listening tests, including 5-scale Mean Opinion Score (MOS) tests and side-by-side preference tests
Reference
- J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. C. Courville, and Y. Bengio, “Char2Wav: End-to-End Speech Synthesis,” in Proc. ICLR, 2017.
- Y. Wang, RJ Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards End-to-End Speech Synthesis,” in Proc. Interspeech, 2017, pp. 4006–4010.
- W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep Voice 3: 2000-Speaker Neural Text-toSpeech,” in Proc. ICLR, 2018.
- N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural Speech Synthesis with Transformer Network,” in Proc. AAAI, 2019, vol. 33, pp. 6706–6713.
- A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” arXiv:1609.03499, 2016.
- J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, RJ Skerrv-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” in Proc. ICASSP, 2018.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in Proc. ICLR, 2015.
- R. J. Williams and D. Zipser, “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks,” Neural Computation, vol. 1, no. 2, pp. 270–280, 1989.
- S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks,” in Proc. NIPS, 2015, pp. 1171–1179.
- A. Goyal, A. Lamb, Y. Zhang, S. Zhang, A. Courville, and Y. Bengio, “Professor Forcing: A New Algorithm for Training Recurrent Networks,” in Proc. NIPS, 2016, pp. 4601–4609.
- M. He, Y. Deng, and L. He, “Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS,” in Proc. Interspeech, 2019, pp. 1293–1297.
- Y. Zheng, J. Tao, Z. Wen, and J. Yi, “Forward–backward decoding sequence for regularizing end-to-end tts,” IEEE/ACM Trans. Audio Speech & Lang. Process., vol. 27, no. 12, pp. 2067–2079, 2019.
- H. Guo, F. K. Soong, L. He, and L. Xie, “A New GAN-Based End-to-End TTS Training Algorithm,” in Proc. Interspeech, 2019, pp. 1288–1292.
- E. Battenberg, RJ Skerry-Ryan, S. Mariooryad, D. Stanton, D. Kao, M. Shannon, and T. Bagby, “Location-relative attention mechanisms for robust long-form speech synthesis,” in Proc. ICASSP, 2020, pp. 6194–6198.
- Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech: Fast, Robust and Controllable Text to Speech,” arXiv:1905.09263, 2019.
- Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” arXiv:2006.04558, 2020.
- S. Beliaev, Y. Rebryk, and B. Ginsburg, “TalkNet: Fully-Convolutional Non-Autoregressive Speech Synthesis Model,” arXiv:2005.05514, 2020.
- D. Lim, W. Jang, H. Park, B. Kim, and J. Yoon, “JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment,” arXiv:2005.07799, 2020.
- Z. Zeng, J. Wang, N. Cheng, T. Xia, and J. Xiao, “AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment,” in Proc. ICASSP, 2020, pp. 6714–6718.
- C. Miao, S. Liang, M. Chen, J. Ma, S. Wang, and J. Xiao, “Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow,” in Proc. ICASSP, 2020, pp. 7209–7213.
- J. Donahue, S. Dieleman, M. Bińkowski, E. Elsen, and K. Simonyan, “End-to-End Adversarial Text-to-Speech,” arXiv:2006.03575, 2020.
- A. Lańcucki, “FastPitch: Parallel Text-to-speech with Pitch Prediction,” arXiv:2006.06873, 2020.
- C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, K. Kang, G. Lei, D. Su, and D. Yu, “DurIAN: Duration informed attention network for multimodal synthesis,” arXiv:1909.01700, 2019.
- J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, “Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling,” arXiv:2010.04301, 2020.
- H. Zen, K. Tokuda, and A. Black, “Statistical Parametric Speech Synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064, 2009.
- H. Zen, A. Senior, and M. Schuster, “Statistical Parametric Speech Synthesis Using Deep Neural Networks,” in Proc. ICASSP, 2013, pp. 7962–7966.
- Y. Wang, D. Stanton, Y. Zhang, RJ Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis,” in Proc. ICML, 2018, pp. 5167–5176.
- W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical Generative Modeling for Controllable Speech Synthesis,” in Proc. ICLR, 2019.
- Y. Zhang, R. J. Weiss, H. Zen, Wu Y., Z. Chen, RJ Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning,” in Proc. Interspeech, 2019, pp. 2080–2084.
- G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu, “FullyHierarchical Fine-Grained Prosody Modeling for Interpretable Speech Synthesis,” arXiv:2002.03785, 2020.
- F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli, “Pay Less Attention with Lightweight and Dynamic Convolutions,” in Proc. ICLR, 2019.
- A. Tjandra, C. Liu, F. Zhang, X. Zhang, Y. Wang, G. Synnaeve, S. Nakamura, and G. Zweig, “DEJA-VU: Double Feature Presentation and Iterated Loss in Deep Transformer Networks,” in Proc. ICASSP, 2020, pp. 6899–6903.
- J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher, “NonAutoregressive Neural Machine Translation,” arXiv:1711.02281, 2017.
- J. Lee, E. Mansimov, and K. Cho, “Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement,” arXiv:1802.06901, 2018.
- R. Shu, J. Lee, H. Nakayama, and K. Cho, “Latent-Variable NonAutoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior,” in Proc. AAAI, 2020.
- J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,” 2016.
- D. Talkin and C. W. Wightman, “The Aligner: Text to Speech Alignment using Markov Models and a Pronunciation Dictionary,” in ESCA/IEEE SSW2, 1994.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in Proc. NeurIPS, 2017.
- R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski, “An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution,” in Proc. NeurIPS, 2018, pp. 9605–9616.
- A. Graves, “Generating Sequences with Recurrent Neural Networks,” arXiv:1308.0850, 2013.
- RJ Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. Clark, and R. A. Saurous, “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron,” in Proc. ICML, 2018.
- N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient Neural Audio Synthesis,” in Proc. ICML, 2018, pp. 2410–2419.
Tags
Comments