Tacotron: Towards End-to-End Speech Synthesis

    INTERSPEECH, pp. 4006-4010, 2017.

    Cited by: 164|Bibtex|Views62|Links
    EI
    Keywords:
    mean opinion scorebrittle design choiceneural machine translationspeech synthesissynthesisMore(7+)
    Wei bo:
    Since we do not use techniques such as scheduled sampling, the dropout in the pre-net is critical for the model to generalize, as it provides a noise source to resolve the multiple modalities in the output distribution

    Abstract:

    A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech ...More

    Code:

    Data:

    0
    Introduction
    • Modern text-to-speech (TTS) pipelines are complex (Taylor, 2009). For example, it is common for statistical parametric TTS to have a text frontend extracting various linguistic features, a duration model, an acoustic feature prediction model and a complex signal-processing-based vocoder (Zen et al, 2009; Agiomyrgiannakis, 2015).
    • The authors propose Tacotron, an end-to-end generative TTS model based on the sequence-to-sequence (Sutskever et al, 2014) with attention paradigm (Bahdanau et al, 2014).
    • A vanilla seq2seq model does not work well for character-level inputs.
    • Figure 1 depicts the model, which includes an encoder, an attention-based decoder, and a post-processing net.
    Highlights
    • Modern text-to-speech (TTS) pipelines are complex (Taylor, 2009)
    • We propose Tacotron, an end-to-end generative TTS model based on the sequence-to-sequence (Sutskever et al, 2014) with attention paradigm (Bahdanau et al, 2014)
    • Since we do not use techniques such as scheduled sampling (Bengio et al, 2015), the dropout in the pre-net is critical for the model to generalize, as it provides a noise source to resolve the multiple modalities in the output distribution
    • We compare with a model with the CBHG encoder replaced by a 2-layer residual gated recurrent unit encoder
    • We found that noisy alignment often leads to mispronunciations
    • Attention module, loss function, and Griffin-Lim-based waveform synthesizer are all ripe for improvement
    Results
    • At a high-level, the model takes characters as input and produces spectrogram frames, which are converted to waveforms.
    • Post-processing net Conv1D bank: K=8, conv-k-128-ReLU
    • The authors use a bottleneck layer with dropout as the pre-net in this work, which helps convergence and improves generalization.
    • The authors concatenate the context vector and the attention RNN cell output to form the input to the decoder RNNs. The authors use a stack of GRUs with vertical residual connections (Wu et al, 2016) for the decoder.
    • While the authors could directly predict raw spectrogram, it’s a highly redundant representation for the purpose of learning alignment between speech signal and text.
    • The authors use a post-processing network to convert from the seq2seq target to waveform.
    • The authors use a simple fully-connected output layer to predict the decoder targets.
    • Predicting r frames at once divides the total number of decoder steps by r, which reduces model size, training time, and inference time.
    • Since the authors do not use techniques such as scheduled sampling (Bengio et al, 2015), the dropout in the pre-net is critical for the model to generalize, as it provides a noise source to resolve the multiple modalities in the output distribution.
    • Since the authors use Griffin-Lim as the synthesizer, the post-processing net learns to predict spectral magnitude sampled on a linear-frequency scale.
    • The authors use a CBHG module for the post-processing net, though a simpler architecture likely works as well.
    • The authors use a simple 1 loss for both seq2seq decoder and post-processing net.
    • No pre-net or post-processing net is used, and the decoder directly predicts linear-scale log magnitude spectrogram.
    • The authors compare with a model with the CBHG encoder replaced by a 2-layer residual GRU encoder.
    Conclusion
    • The authors trained a model without the post-processing net while keeping all the other components untouched.
    • The prediction from the post-processing net contains better resolved harmonics and high frequency formant structure, which reduces synthesis artifacts.
    • The authors have proposed Tacotron, an integrated end-to-end generative TTS model that takes a character sequence as input and outputs the corresponding spectrogram.
    • Attention module, loss function, and Griffin-Lim-based waveform synthesizer are all ripe for improvement.
    Summary
    • Modern text-to-speech (TTS) pipelines are complex (Taylor, 2009). For example, it is common for statistical parametric TTS to have a text frontend extracting various linguistic features, a duration model, an acoustic feature prediction model and a complex signal-processing-based vocoder (Zen et al, 2009; Agiomyrgiannakis, 2015).
    • The authors propose Tacotron, an end-to-end generative TTS model based on the sequence-to-sequence (Sutskever et al, 2014) with attention paradigm (Bahdanau et al, 2014).
    • A vanilla seq2seq model does not work well for character-level inputs.
    • Figure 1 depicts the model, which includes an encoder, an attention-based decoder, and a post-processing net.
    • At a high-level, the model takes characters as input and produces spectrogram frames, which are converted to waveforms.
    • Post-processing net Conv1D bank: K=8, conv-k-128-ReLU
    • The authors use a bottleneck layer with dropout as the pre-net in this work, which helps convergence and improves generalization.
    • The authors concatenate the context vector and the attention RNN cell output to form the input to the decoder RNNs. The authors use a stack of GRUs with vertical residual connections (Wu et al, 2016) for the decoder.
    • While the authors could directly predict raw spectrogram, it’s a highly redundant representation for the purpose of learning alignment between speech signal and text.
    • The authors use a post-processing network to convert from the seq2seq target to waveform.
    • The authors use a simple fully-connected output layer to predict the decoder targets.
    • Predicting r frames at once divides the total number of decoder steps by r, which reduces model size, training time, and inference time.
    • Since the authors do not use techniques such as scheduled sampling (Bengio et al, 2015), the dropout in the pre-net is critical for the model to generalize, as it provides a noise source to resolve the multiple modalities in the output distribution.
    • Since the authors use Griffin-Lim as the synthesizer, the post-processing net learns to predict spectral magnitude sampled on a linear-frequency scale.
    • The authors use a CBHG module for the post-processing net, though a simpler architecture likely works as well.
    • The authors use a simple 1 loss for both seq2seq decoder and post-processing net.
    • No pre-net or post-processing net is used, and the decoder directly predicts linear-scale log magnitude spectrogram.
    • The authors compare with a model with the CBHG encoder replaced by a 2-layer residual GRU encoder.
    • The authors trained a model without the post-processing net while keeping all the other components untouched.
    • The prediction from the post-processing net contains better resolved harmonics and high frequency formant structure, which reduces synthesis artifacts.
    • The authors have proposed Tacotron, an integrated end-to-end generative TTS model that takes a character sequence as input and outputs the corresponding spectrogram.
    • Attention module, loss function, and Griffin-Lim-based waveform synthesizer are all ripe for improvement.
    Tables
    • Table1: Hyper-parameters and network architectures. “conv-k-c-ReLU” denotes 1-D convolution with width k and c output channels with ReLU activation. FC stands for fully-connected
    Download tables as Excel
    Related work
    • WaveNet (van den Oord et al, 2016) is a powerful generative model of audio. It works well for TTS, but is slow due to its sample-level autoregressive nature. It also requires conditioning on linguistic features from an existing TTS frontend, and thus is not end-to-end: it only replaces the vocoder and acoustic model. Another recently-developed neural model is DeepVoice (Arik et al, 2017), which replaces every component in a typical TTS pipeline by a corresponding neural network. However, each component is independently trained, and it’s nontrivial to change the system to train in an end-to-end fashion.

      To our knowledge, Wang et al (2016) is the earliest work touching end-to-end TTS using seq2seq with attention. However, it requires a pre-trained hidden Markov model (HMM) aligner to help the seq2seq model learn the alignment. It’s hard to tell how much alignment is learned by the seq2seq per se. Second, a few tricks are used to get the model trained, which the authors note hurts prosody. Third, it predicts vocoder parameters hence needs a vocoder. Furthermore, the model is trained on phoneme inputs and the experimental results seem to be somewhat limited.
    Reference
    • Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
      Findings
    • Yannis Agiomyrgiannakis. Vocaine the vocoder and applications in speech synthesis. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 4230– 4234. IEEE, 2015.
      Google ScholarLocate open access versionFindings
    • Sercan Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Jonathan Raiman, Shubho Sengupta, and Mohammad Shoeybi. Deep voice: Real-time neural text-to-speech. arXiv preprint arXiv:1702.07825, 2017.
      Findings
    • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
      Findings
    • Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179, 2015.
      Google ScholarLocate open access versionFindings
    • William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 4960–4964. IEEE, 2016.
      Google ScholarLocate open access versionFindings
    • Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
      Findings
    • Xavi Gonzalvo, Siamak Tazari, Chun-an Chan, Markus Becker, Alexander Gutkin, and Hanna Silen. Recent advances in Google real-time HMM-driven unit selection synthesizer. In Proc. Interspeech, pp. 2238–2242, 2016.
      Google ScholarLocate open access versionFindings
    • Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
      Google ScholarLocate open access versionFindings
    • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
      Google ScholarLocate open access versionFindings
    • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
      Findings
    • Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.
      Google ScholarLocate open access versionFindings
    • Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully character-level neural machine translation without explicit segmentation. arXiv preprint arXiv:1610.03017, 2016.
      Findings
    • Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
      Findings
    • Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. Char2Wav: End-to-end speech synthesis. In ICLR2017 workshop submission, 2017.
      Google ScholarLocate open access versionFindings
    • Richard Sproat and Navdeep Jaitly. RNN approaches to text normalization: A challenge. arXiv preprint arXiv:1611.00068, 2016.
      Findings
    • Rupesh Kumar Srivastava, Klaus Greff, and Jurgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
      Findings
    • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
      Google ScholarLocate open access versionFindings
    • Paul Taylor. Text-to-speech synthesis. Cambridge university press, 2009.
      Google ScholarFindings
    • Lucas Theis, Aaron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
      Findings
    • Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
      Findings
    • Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems, pp. 2773– 2781, 2015.
      Google ScholarLocate open access versionFindings
    • Wenfu Wang, Shuang Xu, and Bo Xu. First step towards end-to-end parametric TTS synthesis: Generating spectral parameters with neural attention. In Proceedings Interspeech, pp. 2243–2247, 2016.
      Google ScholarLocate open access versionFindings
    • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
      Findings
    • Heiga Zen, Keiichi Tokuda, and Alan W Black. Statistical parametric speech synthesis. Speech Communication, 51(11):1039–1064, 2009.
      Google ScholarLocate open access versionFindings
    • Heiga Zen, Yannis Agiomyrgiannakis, Niels Egberts, Fergus Henderson, and Przemysław Szczepaniak. Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. Proceedings Interspeech, 2016.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments