High Fidelity Speech Synthesis with Adversarial Networks

Mikołaj Bińkowski
Mikołaj Bińkowski
Aidan Clark
Aidan Clark
Norman Casagrande
Norman Casagrande
Luis C. Cobo
Luis C. Cobo

ICLR, 2020.

Cited by: 57|Views287
EI
Weibo:
Our architectural exploration lead to the development of a model with an ensemble of unconditional and conditional Random Window Discriminators operating at different window sizes, which respectively assess the realism of the generated speech and its correspondence with the input...

Abstract:

Generative adversarial networks have seen rapid development in recent years and have led to remarkable improvements in generative modelling of images. However, their application in the audio domain has received limited attention, and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio...More
0
Full Text
Bibtex
Weibo
Introduction
  • The Text-to-Speech (TTS) task consists in the conversion of text into speech audio. In recent years, the TTS field has seen remarkable progress, sparked by the development of neural autoregressive models for raw audio waveforms such as WaveNet (van den Oord et al, 2016), SampleRNN (Mehri et al, 2017) and WaveRNN (Kalchbrenner et al, 2018).
  • A lot of recent research on neural models for TTS has focused on improving parallelism by predicting multiple time steps in parallel, e.g. using flow-based models (van den Oord et al, 2018; Ping et al, 2019; Prenger et al, 2019; Kim et al, 2019)
  • Such highly parallelisable models are more suitable to run efficiently on modern hardware.
  • GANs currently constitute one of the dominant paradigms for generative modelling of images, and they are able to produce high-fidelity samples that are almost indistinguishable from real data
  • Their application to audio generation tasks has seen relatively limited success so far.
Highlights
  • The Text-to-Speech (TTS) task consists in the conversion of text into speech audio
  • We propose a family of quantitative metrics for speech generation based on Frechet Inception Distance (FID, Heusel et al, 2017) and Kernel Inception Distance (KID, Binkowski et al, 2018), where we replace the Inception image recognition network with the DeepSpeech audio recognition network
  • We provide subjective human evaluation of our model using Mean Opinion Scores (MOS), as well as quantitative metrics
  • Our architectural exploration lead to the development of a model with an ensemble of unconditional and conditional Random Window Discriminators operating at different window sizes, which respectively assess the realism of the generated speech and its correspondence with the input text
  • We have proposed a family of quantitative metrics for generative models of speech: Frechet DeepSpeech Distance and Kernel DeepSpeech Distance, and demonstrated experimentally that these metrics rank models in line with Mean Opinion Scores obtained through human evaluation
  • The metrics are publicly available for machine learning community, as is the DeepSpeech recognition model they are based on
Methods
  • The authors discuss the experiments, comparing GAN-TTS with WaveNet and carrying out ablations that validate the architectural choices.

    As mentioned in Section 3, the main architectural choices made in the model include the use of multiple RWDs, conditional and unconditional, with a number of different downsampling factors.
  • The authors discuss the experiments, comparing GAN-TTS with WaveNet and carrying out ablations that validate the architectural choices.
  • As mentioned in Section 3, the main architectural choices made in the model include the use of multiple RWDs, conditional and unconditional, with a number of different downsampling factors.
  • Single conditional RWD: cRWD1, 3.
  • Multiple conditional RWDs: cRWD{1,2,4,8,15} = k∈{1,2,4,8,15} cRWDk, 4.
  • Single conditional and single unconditional RWD: cRWD1 + uRWD1, 5.
  • 10 RWDs without downsampling but with different window sizes: RWD1,240×{1,2,4,8,15} = k∈{1,2,4,8,15}.
Results
  • The authors provide subjective human evaluation of the model using Mean Opinion Scores (MOS), as well as quantitative metrics. 4.1 MOS

    The authors evaluate the model on a set of 1000 sentences, using human evaluators.
  • Each evaluator was asked to mark the subjective naturalness of a sentence on a 1-5 Likert scale, comparing to the scores reported by van den Oord et al (2018) for WaveNet and Parallel WaveNet. the model was trained to generate 2 second audio clips with the starting point not necessarily aligned with the beginning of a sentence, the authors are able to generate samples of arbitrary length.
  • Human evaluators scored full sentences with a length of up to 15 seconds. ➞ ➞
Conclusion
  • Unlike state-ofthe-art text-to-speech models, GAN-TTS is adversarially trained and the resulting generator is a feed-forward convolutional network.
  • This allows for very efficient audio generation, which is important in practical applications.
  • The authors' architectural exploration lead to the development of a model with an ensemble of unconditional and conditional Random Window Discriminators operating at different window sizes, which respectively assess the realism of the generated speech and its correspondence with the input text.
Summary
  • Introduction:

    The Text-to-Speech (TTS) task consists in the conversion of text into speech audio. In recent years, the TTS field has seen remarkable progress, sparked by the development of neural autoregressive models for raw audio waveforms such as WaveNet (van den Oord et al, 2016), SampleRNN (Mehri et al, 2017) and WaveRNN (Kalchbrenner et al, 2018).
  • A lot of recent research on neural models for TTS has focused on improving parallelism by predicting multiple time steps in parallel, e.g. using flow-based models (van den Oord et al, 2018; Ping et al, 2019; Prenger et al, 2019; Kim et al, 2019)
  • Such highly parallelisable models are more suitable to run efficiently on modern hardware.
  • GANs currently constitute one of the dominant paradigms for generative modelling of images, and they are able to produce high-fidelity samples that are almost indistinguishable from real data
  • Their application to audio generation tasks has seen relatively limited success so far.
  • Methods:

    The authors discuss the experiments, comparing GAN-TTS with WaveNet and carrying out ablations that validate the architectural choices.

    As mentioned in Section 3, the main architectural choices made in the model include the use of multiple RWDs, conditional and unconditional, with a number of different downsampling factors.
  • The authors discuss the experiments, comparing GAN-TTS with WaveNet and carrying out ablations that validate the architectural choices.
  • As mentioned in Section 3, the main architectural choices made in the model include the use of multiple RWDs, conditional and unconditional, with a number of different downsampling factors.
  • Single conditional RWD: cRWD1, 3.
  • Multiple conditional RWDs: cRWD{1,2,4,8,15} = k∈{1,2,4,8,15} cRWDk, 4.
  • Single conditional and single unconditional RWD: cRWD1 + uRWD1, 5.
  • 10 RWDs without downsampling but with different window sizes: RWD1,240×{1,2,4,8,15} = k∈{1,2,4,8,15}.
  • Results:

    The authors provide subjective human evaluation of the model using Mean Opinion Scores (MOS), as well as quantitative metrics. 4.1 MOS

    The authors evaluate the model on a set of 1000 sentences, using human evaluators.
  • Each evaluator was asked to mark the subjective naturalness of a sentence on a 1-5 Likert scale, comparing to the scores reported by van den Oord et al (2018) for WaveNet and Parallel WaveNet. the model was trained to generate 2 second audio clips with the starting point not necessarily aligned with the beginning of a sentence, the authors are able to generate samples of arbitrary length.
  • Human evaluators scored full sentences with a length of up to 15 seconds. ➞ ➞
  • Conclusion:

    Unlike state-ofthe-art text-to-speech models, GAN-TTS is adversarially trained and the resulting generator is a feed-forward convolutional network.
  • This allows for very efficient audio generation, which is important in practical applications.
  • The authors' architectural exploration lead to the development of a model with an ensemble of unconditional and conditional Random Window Discriminators operating at different window sizes, which respectively assess the realism of the generated speech and its correspondence with the input text.
Tables
  • Table1: Results from prior work, the ablation study and the proposed model. Mean opinion scores for natural speech, WaveNet and Parallel WaveNet are taken from <a class="ref-link" id="cvan_den_Oord_et+al_2018_a" href="#rvan_den_Oord_et+al_2018_a">van den Oord et al (2018</a>) and are not directly comparable due to dataset differences. For natural speech we present estimated FDSD – non-zero due to the bias of the estimator – and theoretical values of KDSD and cKDSD. cFDSD is unavailable; see Appendix B.2
  • Table2: Architecture of GAN-TTS’s Generator. t denotes the temporal dimension, while ch denotes the number of channels. The rightmost three columns describe dimensions of the output of the corresponding layer
  • Table3: Downsample factors in discriminators for different initial stride values k
Download tables as Excel
Related work
  • 2.1 AUDIO GENERATION

    Most neural models for audio generation are likelihood-based: they represent an explicit probability distribution and the likelihood of the observed data is maximised under this distribution. Autoregressive models achieve this by factorising the joint distribution into a product of conditional distributions (van den Oord et al, 2016; Mehri et al, 2017; Kalchbrenner et al, 2018; Arik et al, 2017). Another strategy is to use an invertible feed-forward neural network to model the joint density directly (Prenger et al, 2019; Kim et al, 2019). Alternatively, an invertible feed-forward model can be trained by distilling an autoregressive model using probability density distillation (van den Oord et al, 2018; Ping et al, 2019), which enables it to focus on particular modes. Such mode-seeking behaviour is often desirable in conditional generation settings: we want the generated speech signals to sound realistic and correspond to the given text, but we are not interested in modelling every possible variation that occurs in the data. This reduces model capacity requirements, because parts of the data distribution may be ignored. Note that adversarial models exhibit similar behaviour, but without the distillation and invertibility requirements.
Funding
  • Introduces GAN-TTS, a Generative Adversarial Network for Text-to-Speech
  • Employs both subjective human evaluation , as well as novel quantitative metrics , which finds to be well correlated with MOS
  • Shows that GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator
  • Explores raw waveform generation with GANs, and demonstrate that adversarially trained feed-forward generators are able to synthesise high-fidelity speech audio
  • Introduces GAN-TTS, a Generative Adversarial Network for text-conditional highfidelity speech synthesis
Reference
  • Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. Deep Speech 2: End-to-end speech recognition in English and Mandarin. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • Sercan O Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al. Deep Voice: Real-time neural text-to-speech. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Mikołaj Binkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Aidan Clark, Jeff Donahue, and Karen Simonyan. Efficient video generation on complex datasets. arXiv:1907.06571, 2019.
    Findings
  • Emily L Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. In NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Jeff Donahue and Karen Simonyan. arXiv:1907.02544, 2019.
    Findings
  • Jeff Donahue, Philipp Krahenbuhl, and Trevor Darrell. Adversarial feature learning. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. In ICLR, 2017a.
    Google ScholarLocate open access versionFindings
  • Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. In ICLR, 2017b.
    Google ScholarLocate open access versionFindings
  • Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. Neural audio synthesis of musical notes with WaveNet autoencoders. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. GANSynth: Adversarial neural audio synthesis. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
    Google ScholarLocate open access versionFindings
  • Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. Deep Voice 2: Multi-speaker neural text-to-speech. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
    Google ScholarLocate open access versionFindings
  • A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola. A kernel two-sample test. JMLR, 2012.
    Google ScholarLocate open access versionFindings
  • Daniel Griffin and Jae Lim. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-toimage translation. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Frechet audio distance: A metric for evaluating music enhancement algorithms. In Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • Sungwon Kim, Sang-Gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon. FloWaveNet: A generative flow for raw audio. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Carl Case, and Paulius Micikevicius. OpenSeq2Seq: Extensible toolkit for distributed and mixed precision training of sequenceto-sequence models. In NLP-OSS, 2018.
    Google ScholarLocate open access versionFindings
  • Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Jonathan Le Roux, Hirokazu Kameoka, Nobutaka Ono, and Shigeki Sagayama. Fast signal reconstruction from magnitude STFT spectrogram based on spectrogram consistency. In DAFx, 2010.
    Google ScholarLocate open access versionFindings
  • Chuan Li and Michael Wand. Precomputed real-time texture synthesis with Markovian generative adversarial networks. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Jae Hyun Lim and Jong Chul Ye. Geometric GAN. arXiv:1705.02894, 2017.
    Findings
  • Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Paarth Neekhara, Chris Donahue, Miller Puckette, Shlomo Dubnov, and Julian McAuley. Expediting TTS synthesis with adversarial vocoding. In Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep Voice 3: 2000-speaker neural text-to-speech. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Wei Ping, Kainan Peng, and Jitong Chen. ClariNet: Parallel wave generation in end-to-end text-tospeech. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Ryan Prenger, Rafael Valle, and Bryan Catanzaro. WaveGlow: A flow-based generative network for speech synthesis. In ICASSP, 2019.
    Google ScholarLocate open access versionFindings
  • Masaki Saito and Shunta Saito. TGANv2: Efficient training of large models for video generation with multiple subsampling layers. arXiv:1811.09245, 2018.
    Findings
  • Y. Saito, S. Takamichi, and H. Saruwatari. Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1):84–96, Jan 2018.
    Google ScholarLocate open access versionFindings
  • Andrew Saxe, James McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In ICASSP, 2018.
    Google ScholarLocate open access versionFindings
  • Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. Char2Wav: End-to-end speech synthesis. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception architecture for computer vision. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, and Hirokazu Kameoka. Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 632–639. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv:1609.03499, 2016.
    Findings
  • Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. Parallel WaveNet: Fast high-fidelity speech synthesis. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Sean Vasquez and Mike Lewis. MelNet: A generative model for audio in the frequency domain. arXiv:1906.01083, 2019.
    Findings
  • Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. In Interspeech, 2017.
    Google ScholarLocate open access versionFindings
  • Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Probability density distillation with generative adversarial networks for high-quality parallel waveform generation. arXiv preprint arXiv:1904.04472, 2019.
    Findings
  • Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments