Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

Jonathan Shen
Jonathan Shen
Ye Jia
Ye Jia
Mike Chrzanowski
Mike Chrzanowski
Isaac Elias
Isaac Elias
Cited by: 0|Bibtex|Views10|Links
Keywords:
automatic speech recognitionword error rateUnaligned duration ratioGaussian mixture modelduration predictorMore(17+)
Weibo:
This paper presented Non-Attentive Tacotron, demonstrating a significant improvement in robustness compared to Tacotron 2 as measured by unaligned duration ratio and word deletion rate, while slightly outperforming it in naturalness

Abstract:

This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation ...More

Code:

Data:

0
Introduction
  • Model-based text-to-speech (TTS) synthesis has evolved from hidden Markov model (HMM)-based approaches (Zen et al, 2009) to modern deep neural network-based ones.
  • As general focus turned towards end-to-end approaches, the sequence-to-sequence model with attention mechanism used in neural machine translation (NMT) (Bahdanau et al, 2015) and automatic speech recognition (ASR) (Chan et al, 2016) became an attractive option, removing the need to represent durations explicitly
  • This led to works such as Char2Wav (Sotelo et al, 2017), Tacotron (Wang et al, 2017; Shen et al, 2018), Deep Voice 3 (Ping et al, 2018), and Transformer TTS (Li et al, 2019), each of them autoregressive networks that predict the output one frame at a time.
  • Similar models have been used for more complicated problems, like direct speech-to-speech translation (Jia et al, 2019), speech conversion (Biadsy et al, 2019), and speech enhancement (Ding et al, 2020)
Highlights
  • In the past decade, model-based text-to-speech (TTS) synthesis has evolved from hidden Markov model (HMM)-based approaches (Zen et al, 2009) to modern deep neural network-based ones
  • As general focus turned towards end-to-end approaches, the sequence-to-sequence model with attention mechanism used in neural machine translation (NMT) (Bahdanau et al, 2015) and automatic speech recognition (ASR) (Chan et al, 2016) became an attractive option, removing the need to represent durations explicitly
  • We evaluated the robustness of the neural TTS models by measuring Unaligned duration ratio (UDR) and word deletion rate (WDR) on two large evaluation sets: LibriTTS: 354K sentences from all train subsets from the LibriTTS corpus (Zen et al, 2019); and web-long: 100K long sentences mined from the web, which included a small amount of irregular text such as programming code
  • This paper presented Non-Attentive Tacotron, demonstrating a significant improvement in robustness compared to Tacotron 2 as measured by unaligned duration ratio and word deletion rate, while slightly outperforming it in naturalness
  • We showed the ability to control the pacing of the entire utterance as well as individual words using the duration predictor
  • We demonstrated a method of modeling duration in a semi-supervised or unsupervised manner within Non-Attentive Tacotron when accurate target duration are scarce or unavailable by using a fine-grained variational auto-encoder, with results almost as good as supervised training
Methods
  • All models were trained on a proprietary dataset with 66 speakers with 4 different English accents (US, British, Australian, and Nigerian).
  • A preliminary experiment comparing different attention mechanisms (including monotonic, stepwise monotonic, dynamic convolution and GMM attention (GMMA)) showed that GMMA performed the best.
  • The authors compared the non-attentive Tacotron with Tacotron 2 with GMMA as well as location-sensitive attention (LSA) which was used in the original Model.
  • Tacotron 2 w/ LSA w/ GMMA.
  • Non-Attentive Tacotron w/ Gauss.
Conclusion
  • This paper presented Non-Attentive Tacotron, demonstrating a significant improvement in robustness compared to Tacotron 2 as measured by unaligned duration ratio and word deletion rate, while slightly outperforming it in naturalness.
  • This was achieved by replacing the attention mechanism in Tacotron 2 with an explicit duration predictor and Gaussian upsampling.
  • The authors demonstrated a method of modeling duration in a semi-supervised or unsupervised manner within Non-Attentive Tacotron when accurate target duration are scarce or unavailable by using a fine-grained variational auto-encoder, with results almost as good as supervised training
Summary
  • Introduction:

    Model-based text-to-speech (TTS) synthesis has evolved from hidden Markov model (HMM)-based approaches (Zen et al, 2009) to modern deep neural network-based ones.
  • As general focus turned towards end-to-end approaches, the sequence-to-sequence model with attention mechanism used in neural machine translation (NMT) (Bahdanau et al, 2015) and automatic speech recognition (ASR) (Chan et al, 2016) became an attractive option, removing the need to represent durations explicitly
  • This led to works such as Char2Wav (Sotelo et al, 2017), Tacotron (Wang et al, 2017; Shen et al, 2018), Deep Voice 3 (Ping et al, 2018), and Transformer TTS (Li et al, 2019), each of them autoregressive networks that predict the output one frame at a time.
  • Similar models have been used for more complicated problems, like direct speech-to-speech translation (Jia et al, 2019), speech conversion (Biadsy et al, 2019), and speech enhancement (Ding et al, 2020)
  • Methods:

    All models were trained on a proprietary dataset with 66 speakers with 4 different English accents (US, British, Australian, and Nigerian).
  • A preliminary experiment comparing different attention mechanisms (including monotonic, stepwise monotonic, dynamic convolution and GMM attention (GMMA)) showed that GMMA performed the best.
  • The authors compared the non-attentive Tacotron with Tacotron 2 with GMMA as well as location-sensitive attention (LSA) which was used in the original Model.
  • Tacotron 2 w/ LSA w/ GMMA.
  • Non-Attentive Tacotron w/ Gauss.
  • Conclusion:

    This paper presented Non-Attentive Tacotron, demonstrating a significant improvement in robustness compared to Tacotron 2 as measured by unaligned duration ratio and word deletion rate, while slightly outperforming it in naturalness.
  • This was achieved by replacing the attention mechanism in Tacotron 2 with an explicit duration predictor and Gaussian upsampling.
  • The authors demonstrated a method of modeling duration in a semi-supervised or unsupervised manner within Non-Attentive Tacotron when accurate target duration are scarce or unavailable by using a fine-grained variational auto-encoder, with results almost as good as supervised training
Tables
  • Table1: MOS with 95% confidence intervals
  • Table2: Robustness measured by UDR and WDR on two large evaluation sets
  • Table3: Performance of controlling the utterance-wide pace of the synthesized speech
  • Table4: Performance of unsupervised and semi-supervised duration modeling. Zero vectors are used as FVAE latents for inference. MAE denotes the mean absolute error
  • Table5: Model parameters
  • Table6: WER breakdowns in the robustness evaluation. Deletion rate (del) is the WDR in Table 2
  • Table7: Classification of some TTS models into autoregressive (AR)/feed-forward (FF), RNN/Transformer/fully convolutional, and attention-based/duration-based
Download tables as Excel
Study subjects and analysis
speakers with 4 different English accents: 66
When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good as supervised training. All models were trained on a proprietary dataset with 66 speakers with 4 different English accents (US, British, Australian, and Nigerian). The amount of data per speaker varied from merely 5 seconds to 47 hours, totalling 354 hours

speakers with 4 different English accents: 66
As the ASR system will make mistakes, the metrics above are just an upper-bound on the actual failures of the TTS system. All models were trained on a proprietary dataset with 66 speakers with 4 different English accents (US, British, Australian, and Nigerian). The amount of data per speaker varied from merely 5 seconds to 47 hours, totalling 354 hours

US English speakers: 10
The naturalness of the synthesized speech was evaluated through subjective listening tests, including 5-scale Mean Opinion Score (MOS) tests and side-by-side preference tests. The sentences were synthesized using 10 US English speakers (5 male / 5 female) in a round-robin style. The amount of training data for the evaluated speakers varied from 3 hours to 47 hours

speakers: 10
The median text lengths of the two sets were 74 and 224 characters, respectively. The input was synthesized with the same 10 speakers in subsection 5.1 in round-robin style. All model outputs were capped at 120 seconds

speakers: 10
Table 3 shows WER and MOS results after modifying the utterance-wide pace by dividing the predicted durations by various factors. The WER is computed on speech synthesized on transcripts from the LibriTTS test-clean subset with the same 10 speakers in subsection 5.1, and then transcribed by an ASR model described in Park et al (2020) with a WER of 2.3% on the ground truth audio. I'm so saddened about the devastation in Big Basin in Big Basin

speakers: 10
Ten different US English speakers (5 male / 5 female) each with about 4 hours of training data were used for evaluating the performance of the unsupervised and semi-supervised duration modeling. The duration labels for these 10 speakers were withheld for the semi-supervised models, and all duration labels were withheld for the unsupervised models. Figure 4 shows predicted alignment after Gaussian upsampling and the internal alignment from the attention module in the FVAE compared with the alignment computed from the target durations, for the unsupervised model

Reference
  • Sercan O Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, Shubho Sengupta, and Mohammad Shoeybi. Deep Voice: Real-Time Neural Text-to-Speech. In Proc. ICML, pp. 195–204, 2017.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Eric Battenberg, RJ Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao, Matt Shannon, and Tom Bagby. Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis. In Proc. ICASSP, 2020.
    Google ScholarLocate open access versionFindings
  • Stanislav Beliaev, Yurii Rebryk, and Boris Ginsburg. TalkNet: Fully-Convolutional NonAutoregressive Speech Synthesis Model. arXiv preprint arXiv:2005.05514, 2020.
    Findings
  • Fadi Biadsy, Ron J. Weiss, Pedro J. Moreno, Dimitri Kanvesky, and Ye. Jia. Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications Hearing-Impaired Speech and Speech Separation. In Proc. Interspeech, pp. 4115–4119, 2019.
    Google ScholarLocate open access versionFindings
  • W. Chan, N. Jaitly, Q. Le, and O. Vinyals. Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. In Proc. ICASSP, pp. 4960–4964, 2016.
    Google ScholarLocate open access versionFindings
  • Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. WaveGrad: Estimating Gradients for Waveform Generation. arXiv preprint arXiv:2009.00713, 2020.
    Findings
  • Chung-Cheng Chiu, Anshuman Tripathi, Katherine Chou, Chris Co, Navdeep Jaitly, Diana Jaunzeikare, Anjuli Kannan, Patrick Nguyen, Hasim Sak, Ananth Sankar, Justin Tansuwan, Nathan Wan, Yonghui Wu, and Xuedong Zhang. Speech Recognition for Medical Conversations. In Proc. Interspeech, pp. 2972–2976, 2018.
    Google ScholarLocate open access versionFindings
  • Shaojin Ding, Ye Jia, Ke Hu, and Quan Wang. Textual Echo Cancellation. arXiv preprint arXiv:2008.06006, 2020.
    Findings
  • Jeff Donahue, Sander Dieleman, Mikołaj Binkowski, Erich Elsen, and Karen Simonyan. End-to-End Adversarial Text-to-Speech. arXiv preprint arXiv:2006.03575, 2020.
    Findings
  • Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. Deep voice 2: Multi-speaker neural text-to-speech. In Advances in neural information processing systems, pp. 2962–2970, 2017.
    Google ScholarLocate open access versionFindings
  • Alex Graves. Generating Sequences with Recurrent Neural Networks. arXiv preprint arXiv:1308.0850, 2013.
    Findings
  • Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017.
    Findings
  • Haohan Guo, Frank K Soong, Lei He, and Lei Xie. A New GAN-based End-to-End TTS Training Algorithm. In Proc. Interspeech, pp. 1288–1292, 2019.
    Google ScholarLocate open access versionFindings
  • Mutian He, Yan Deng, and Lei He. Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS. In Proc. Interspeech, pp. 1293–1297, 2019.
    Google ScholarLocate open access versionFindings
  • Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez-Moreno, and Yonghui Wu. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. In Proc. NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Ye Jia, Ron J Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, and Yonghui Wu. Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model. In Proc. Interspeech, pp. 1123–1127, 2019.
    Google ScholarLocate open access versionFindings
  • Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazare, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, and Emmanuel Dupoux. Librilight: A Benchmark for ASR with Limited or No Supervision. In Proc. ICASSP, pp. 7669–7673, 2020.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient Neural Audio Synthesis. In Proc. ICML, pp. 2410–2419, 2018.
    Google ScholarLocate open access versionFindings
  • Tom Kenter, Vincent Wan, Chun-An Chan, Rob Clark, and Jakub Vit. CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network. In Proc. ICML, pp. 3331–3340, 2019.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In Proc. ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron Courville. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In Proc. NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Younggun Lee and Taesu Kim. Robust and fine-grained prosody control of end-to-end speech synthesis. In Proc. ICASSP, pp. 5911–5915, 2019.
    Google ScholarLocate open access versionFindings
  • Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural Speech Synthesis with Transformer Network. In Proc. AAAI, volume 33, pp. 6706–6713, 2019.
    Google ScholarLocate open access versionFindings
  • Dan Lim, Won Jang, Hyeyeong Park, Bongwan Kim, and Jesam Yoon. JDI-T: Jointly Trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment. arXiv preprint arXiv:2005.07799, 2020.
    Findings
  • Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499, 2016.
    Findings
  • Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR Corpus Based on Public Domain Audio Books. In Proc. ICASSP, pp. 5206–5210, 2015.
    Google ScholarLocate open access versionFindings
  • Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le. Improved Noisy Student Training for Automatic Speech Recognition. In Proc. Interspeech, 2020. to appear.
    Google ScholarLocate open access versionFindings
  • Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep Voice 3: Scaling text-to-speech with convolutional sequence learning. In Proc. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Ryan Prenger, Rafael Valle, and Bryan Catanzaro. WaveGlow: A Flow-based Generative Network for Speech Synthesis. In Proc. ICASSP, 2019.
    Google ScholarLocate open access versionFindings
  • Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech: Fast, Robust and Controllable Text to Speech. In Proc. NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Yi Ren, Chenxu Hu, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech. arXiv preprint arXiv:2006.04558, 2020.
    Findings
  • Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerrv-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. In Proc. ICASSP, pp. 4779–4783, 2018.
    Google ScholarLocate open access versionFindings
  • RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, and Rif A Saurous. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. In Proc. ICML, pp. 4700–4709, 2018.
    Google ScholarLocate open access versionFindings
  • Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron C. Courville, and Yoshua Bengio. Char2Wav: End-to-End speech synthesis. In Proc. ICLR workshop, 2017.
    Google ScholarLocate open access versionFindings
  • Guangzhi Sun, Yu Zhang, Ron J Weiss, Yuan Cao, Heiga Zen, and Yonghui Wu. Fully-Hierarchical Fine-Grained Prosody Modeling for Interpretable Speech Synthesis. In Proc. ICASSP, pp. 6264– 6268, 2020.
    Google ScholarLocate open access versionFindings
  • David Talkin and Colin W Wightman. The aligner: Text to speech alignment using markov models and a pronunciation dictionary. In The Second ESCA/IEEE Workshop on Speech Synthesis, 1994.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In Proc. NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. Saurous. Tacotron: Towards End-to-End Speech Synthesis. In Proc. Interspeech, pp. 4006– 4010, 2017.
    Google ScholarLocate open access versionFindings
  • Ronald J. Williams and David Zipser. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Computation, 1(2):270–280, 1989.
    Google ScholarLocate open access versionFindings
  • Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu, Deyi Tuo, Shiyin Kang, Guangzhi Lei, Dan Su, and Dong Yu. DurIAN: Duration informed attention network for multimodal synthesis. arXiv:1909.01700, 2019.
    Findings
  • H. Zen, K. Tokuda, and A. Black. Statistical Parametric Speech Synthesis. Speech Communication, 51(11):1039–1064, 2009.
    Google ScholarLocate open access versionFindings
  • Heiga Zen, Andrew Senior, and Mike Schuster. Statistical Parametric Speech Synthesis Using Deep Neural Networks. In Proc. ICASSP, pp. 7962–7966, 2013.
    Google ScholarLocate open access versionFindings
  • Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Proc. Interspeech, pp. 1526–1530, 2019.
    Google ScholarLocate open access versionFindings
  • Zhen Zeng, Jianzong Wang, Ning Cheng, Tian Xia, and Jing Xiao. AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment. In Proc. ICASSP, pp. 6714–6718, 2020.
    Google ScholarLocate open access versionFindings
  • Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong Dai. Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. In Proc. ICASSP, pp. 4789–4793, 2018.
    Google ScholarLocate open access versionFindings
  • Zewang Zhang, Qiao Tian, Heng Lu, Ling-Hui Chen, and Shan Liu. Adadurian: Few-shot adaptation for neural text-to-speech with durian. arXiv preprint arXiv:2005.05642, 2020.
    Findings
  • Yibin Zheng, Jianhua Tao, Wen. Zhengqi, and Jiangyan Yi. Forward–backward decoding sequence for regularizing end-to-end tts. IEEE/ACM Trans. Audio Speech & Lang. Process., 27(12):2067–2079, 2019.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments