vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

ICLR, 2020.

Cited by: 6|Bibtex|Views114|Links
EI
Keywords:
speech recognition speech representation learning
Weibo:
We evaluate models on two benchmarks: TIMIT is a 5h dataset with phoneme labels and Wall Street Journal is a 81h dataset for speech recognition

Abstract:

We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which r...More

Code:

Data:

Introduction
  • Learning discrete representations of speech has gathered much recent interest (Versteegh et al, 2016; Dunbar et al, 2019).
  • The authors train a Deep Bidirectional Transformer (BERT; Devlin et al, 2018; Liu et al, 2019) on the discretized unlabeled speech data and input these representations to a standard acoustic model (Figure 1b; §4).
  • The representations produced by the context network ci are input to the acoustic model instead of log-mel filterbank features.
Highlights
  • Learning discrete representations of speech has gathered much recent interest (Versteegh et al, 2016; Dunbar et al, 2019)
  • We evaluate models on two benchmarks: TIMIT (Garofolo et al, 1993b) is a 5h dataset with phoneme labels and Wall Street Journal (WSJ; Garofolo et al 1993a) is a 81h dataset for speech recognition
  • Following van den Oord et al (2017), we found γ = 0.25 to be a robust choice for balancing the vector quantized auxiliary loss
  • We train models with different numbers of groups G and variables V to vary the size of the possible codebook size V G and measure accuracy on TIMIT phoneme recognition without BERT training
  • We consider various lossy compression algorithms applied to the TIMIT audio data and train wav2letter models on the resulting audio: Codec23 as a low bitrate codec, Opus (Terriberry & Vos, 2012) as a medium bitrate codec and MP3 and Ogg Vorbis (Montgomery, 2004) as high bitrate codecs
Results
  • BERT (Devlin et al, 2018) is a pre-training approach for NLP tasks, which uses a transformer encoder model to build a representation of text.
  • Vq-wav2vec, learns vector quantized (VQ) representations of audio data using a future time-step prediction task.
  • One possibility is to use the discretized training data and apply BERT pre-training where the task is to predict masked input tokens based on an encoding of the surrounding context (Devlin et al, 2018).
  • The authors train a wav2letter acoustic model on WSJ by inputting either the BERT or vq-wav2vec representations instead of log-mel filterbanks.2
  • For vq-wav2vec the authors first experiment with the Gumbel-Softmax, with and without a BERT base model (§5.3).
  • Once the audio is discretized the authors can train a standard sequence to sequence model to perform speech recognition.
  • The authors trained an off-the-shelf Big Transformer (Vaswani et al, 2017; Ott et al, 2019) on the vq-wav2vec Gumbel-Softmax discretized Librispeech corpus and evaluated on the Librispeech dev/test sets; the authors use a 4k BPE output vocabulary (Sennrich et al, 2016).
  • The authors train models with different numbers of groups G and variables V to vary the size of the possible codebook size V G and measure accuracy on TIMIT phoneme recognition without BERT training.
  • The authors experiment with vq-wav2vec k-means and train models with 1,2,4,8,16 and 32 groups, using 40,80,160,...,1280 variables, spanning a bitrate range from 0.53 kbit/s (G = 1, V = 40) to 33.03 kbit/s (G = 32, V = 1280).
Conclusion
  • The authors place the quantization module after the aggregator module and train all models in the small vq-wav2vec setup (§5.2) on the 100h clean Librispeech subset.
  • The authors consider various lossy compression algorithms applied to the TIMIT audio data and train wav2letter models on the resulting audio: Codec23 as a low bitrate codec, Opus (Terriberry & Vos, 2012) as a medium bitrate codec and MP3 and Ogg Vorbis (Montgomery, 2004) as high bitrate codecs.
  • BERT training on discretized audio data is fairly robust to masking large parts of the input (Table 5b).
Summary
  • Learning discrete representations of speech has gathered much recent interest (Versteegh et al, 2016; Dunbar et al, 2019).
  • The authors train a Deep Bidirectional Transformer (BERT; Devlin et al, 2018; Liu et al, 2019) on the discretized unlabeled speech data and input these representations to a standard acoustic model (Figure 1b; §4).
  • The representations produced by the context network ci are input to the acoustic model instead of log-mel filterbank features.
  • BERT (Devlin et al, 2018) is a pre-training approach for NLP tasks, which uses a transformer encoder model to build a representation of text.
  • Vq-wav2vec, learns vector quantized (VQ) representations of audio data using a future time-step prediction task.
  • One possibility is to use the discretized training data and apply BERT pre-training where the task is to predict masked input tokens based on an encoding of the surrounding context (Devlin et al, 2018).
  • The authors train a wav2letter acoustic model on WSJ by inputting either the BERT or vq-wav2vec representations instead of log-mel filterbanks.2
  • For vq-wav2vec the authors first experiment with the Gumbel-Softmax, with and without a BERT base model (§5.3).
  • Once the audio is discretized the authors can train a standard sequence to sequence model to perform speech recognition.
  • The authors trained an off-the-shelf Big Transformer (Vaswani et al, 2017; Ott et al, 2019) on the vq-wav2vec Gumbel-Softmax discretized Librispeech corpus and evaluated on the Librispeech dev/test sets; the authors use a 4k BPE output vocabulary (Sennrich et al, 2016).
  • The authors train models with different numbers of groups G and variables V to vary the size of the possible codebook size V G and measure accuracy on TIMIT phoneme recognition without BERT training.
  • The authors experiment with vq-wav2vec k-means and train models with 1,2,4,8,16 and 32 groups, using 40,80,160,...,1280 variables, spanning a bitrate range from 0.53 kbit/s (G = 1, V = 40) to 33.03 kbit/s (G = 32, V = 1280).
  • The authors place the quantization module after the aggregator module and train all models in the small vq-wav2vec setup (§5.2) on the 100h clean Librispeech subset.
  • The authors consider various lossy compression algorithms applied to the TIMIT audio data and train wav2letter models on the resulting audio: Codec23 as a low bitrate codec, Opus (Terriberry & Vos, 2012) as a medium bitrate codec and MP3 and Ogg Vorbis (Montgomery, 2004) as high bitrate codecs.
  • BERT training on discretized audio data is fairly robust to masking large parts of the input (Table 5b).
Tables
  • Table1: WSJ accuracy of vq-wav2vec on the development (nov93dev) and test set (nov92) in terms of letter error rate (LER) and word error rate (WER) without language modeling (No LM), a 4-gram LM and a character convolutional LM. vq-wav2vec with BERT pre-training improves over the best wav2vec model (<a class="ref-link" id="cSchneider_et+al_2019_a" href="#rSchneider_et+al_2019_a">Schneider et al, 2019</a>)
  • Table2: Comparison of Gumbel-Softmax and k-means vector quantization on WSJ (cf
  • Table3: TIMIT phoneme recognition in terms of phoneme error rate (PER). All our models use the CNN-8L-PReLU-do0.7 architecture (Zeghidour et al, 2018)
  • Table4: Librispeech results for a standard sequence to sequence model trained on discretized audio without BERT pre-training and results from the literature. All results are without a language model
  • Table5: TIMIT PER for (a) different mask sizes M with pM = 0.15 in BERT training and (b) mask probabilities p for a fixed mask length M = 10
  • Table6: PER on TIMIT dev set for vq-wav2vec models trained on Libri100. Results are based on three random seeds
  • Table7: Fraction of used codewords vs. number of theoretically possible codewords V G in brackets; 39.9M is the number of tokens in Librispeech 100h
Download tables as Excel
Funding
  • Proposes vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task
Reference
  • Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: Endto-end speech recognition in english and mandarin. In Proc. of ICML, 2016.
    Google ScholarLocate open access versionFindings
  • Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In Proceedings of the International Conference on Computer Vision (ICCV), 2019.
    Google ScholarLocate open access versionFindings
  • Jan Chorowski, Ron J. Weiss, Samy Bengio, and Aaron van den Oord. Unsupervised speech representation learning using wavenet autoencoders. arXiv, abs/1901.08810, 2019.
    Findings
  • Yu-An Chung and James Glass. Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv, abs/1803.08976, 2018.
    Findings
  • Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass. An unsupervised autoregressive model for speech representation learning. arXiv, abs/1904.03240, 2019.
    Findings
  • Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. Wav2letter: an end-to-end convnetbased speech recognition system. arXiv, abs/1609.03193, 2016.
    Findings
  • Ronan Collobert, Awni Hannun, and Gabriel Synnaeve. A fully differentiable beam search decoder. arXiv, abs/1902.06022, 2019.
    Findings
  • FFmpeg Developers. ffmpeg tool software, 2016. URL http://ffmpeg.org/.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, abs/1810.04805, 2018.
    Findings
  • Ewan Dunbar, Robin Algayres, Julien Karadayi, Mathieu Bernard, Juan Benjumea, Xuan-Nga Cao, Lucie Miskic, Charlotte Dugrain, Lucas Ondel, Alan W Black, et al. The zero resource speech challenge 2019: Tts without t. arXiv, 1904.11469, 2019.
    Findings
  • Ryan Eloff, Andre Nortje, Benjamin van Niekerk, Avashna Govender, Leanne Nortje, Arnu Pretorius, Elan Van Biljon, Ewald van der Westhuizen, Lisa van Staden, and Herman Kamper. Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks. arXiv, abs/1904.07556, 2019.
    Findings
  • John S. Garofolo, David Graff, Doug Paul, and David S. Pallett. CSR-I (WSJ0) Complete LDC93S6A. Web Download. Linguistic Data Consortium, 1993a.
    Google ScholarLocate open access versionFindings
  • John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathon G. Fiscus, David S. Pallett, and Nancy L. Dahlgren. The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CDROM. Linguistic Data Consortium, 1993b.
    Google ScholarLocate open access versionFindings
  • Pegah Ghahremani, Vimal Manohar, Hossein Hadian, Daniel Povey, and Sanjeev Khudanpur. Investigation of transfer learning for asr using lf-mmi trained neural networks. In Proc. of ASRU, 2017.
    Google ScholarLocate open access versionFindings
  • Emil Julius Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office, 1954.
    Google ScholarFindings
  • Hossein Hadian, Hossein Sameti1, Daniel Povey, and Sanjeev Khudanpur. End-to-end speech recognition using lattice-free mmi. In Proc. of Interspeech, 2018.
    Google ScholarLocate open access versionFindings
  • Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. Scalable modified Kneser-Ney language model estimation. In Proc. of ACL, 2013.
    Google ScholarLocate open access versionFindings
  • Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach, and Patrick Nguyen. On the choice of modeling unit for sequence-to-sequence speech recognition. Interspeech 2019, Sep 2019. doi: 10.21437/interspeech.2019-2277. URL http://dx.doi.org/10.21437/Interspeech.2019-2277.
    Locate open access versionFindings
  • Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv, abs/1611.01144, 2016.
    Findings
  • Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128, January 2011.
    Google ScholarLocate open access versionFindings
  • Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans. arXiv, abs/1907.10529, 2019.
    Findings
  • Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. Who needs words? lexicon-free speech recognition. In Proc. of Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv, abs/1608.03983, 2016.
    Findings
  • Chris J Maddison, Daniel Tarlow, and Tom Minka. A* sampling. In Advances in Neural Information Processing Systems, pp. 3086–3094, 2014.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Proc. of NIPS, 2013.
    Google ScholarLocate open access versionFindings
  • Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer. Transformers with convolutional context for ASR. CoRR, abs/1904.11660, 2019.
    Findings
  • C Montgomery. Vorbis i specification, 2004.
    Google ScholarFindings
  • Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In Proc. of WMT, 2018.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proc. of NAACL System Demonstrations, 2019.
    Google ScholarLocate open access versionFindings
  • Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In Proc. of ICASSP, pp. 5206–5210. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. Specaugment: A simple data augmentation method for automatic speech recognition, 2019.
    Google ScholarFindings
  • Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, and Yoshua Bengio. Light gated recurrent units for speech recognition. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2):92–102, 2018.
    Google ScholarLocate open access versionFindings
  • Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. CoRR, abs/1904.05862, 2019. URL http://arxiv.org/abs/1904.05862.
    Findings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proc. of ACL, 2016.
    Google ScholarLocate open access versionFindings
  • Tim Terriberry and Koen Vos. Definition of the opus audio codec, 2012.
    Google ScholarFindings
  • Andros Tjandra, Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, and Satoshi Nakamura. Vqvae unsupervised unit discovery and multi-scale code2spec inverter for zerospeech challenge 2019. arXiv, 1905.11449, 2019.
    Findings
  • Published as a conference paper at ICLR 2020 Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In Advances in
    Google ScholarLocate open access versionFindings
  • Neural Information Processing Systems, pp. 6306–6315, 2017. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv, abs/1807.03748, 2018. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. of NIPS, 2017. Maarten Versteegh, Xavier Anguera, Aren Jansen, and Emmanuel Dupoux. The zero resource speech challenge 2015: Proposed approaches and results. Procedia Computer Science, 81:67– 72, 2016. Yuxin Wu and Kaiming He. Group normalization. arXiv, abs/1803.08494, 2018. Neil Zeghidour, Nicolas Usunier, Iasonas Kokkinos, Thomas Schaiz, Gabriel Synnaeve, and Emmanuel Dupoux. Learning filterbanks from raw speech for phone recognition. In Proc. of (ICASSP), 2018.
    Findings
Your rating :
0

 

Tags
Comments