wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

NeurIPS 2020, 2020.

Cited by: 3|Bibtex|Views147|Links
Keywords:
unsupervised pre trainingConnectionist Temporal Classificationspeech representationlanguage modellatent representationMore(7+)
Weibo:
We presented wav2vec 2.0, a framework for self-supervised learning of speech representations which masks latent representations of the raw waveform and solves a contrastive task over quantized speech representations

Abstract:

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the ...More

Code:

Data:

0
Introduction
  • Neural networks benefit from large quantities of labeled training data. in many settings labeled data is much harder to come by than unlabeled data: current speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance which is not available for the vast majority of the nearly 7,000 languages spoken worldwide [30].
  • Self-supervised learning has emerged as a paradigm to learn general data representations from unlabeled examples and to fine-tune the model on labeled data
  • This has been successful for natural language processing [42, 44, 9] and is an active research area for computer vision [19, 2, 35, 18, 6].
  • The latent representations are fed to a Transformer network to build contextualized representations and the model is trained via a contrastive task where the true latent is to be distinguished from distractors [51, 47, 46, 27] (§ 2)
Highlights
  • Neural networks benefit from large quantities of labeled training data
  • The latent representations are fed to a Transformer network to build contextualized representations and the model is trained via a contrastive task where the true latent is to be distinguished from distractors [51, 47, 46, 27] (§ 2)
  • Our results demonstrate the feasibility of ultra-low resource speech recognition: when using only 10 minutes of labeled data, our approach achieves word error rate (WER) 5.7/10.1 on the clean/noisy test sets of Librispeech
  • The models are pre-trained on the audio data of either Librispeech (LS-960) or LibriVox (LV-60k) and most results are obtained by decoding with a Transformer language model (Transf.); Appendix C shows results with other language models
  • We presented wav2vec 2.0, a framework for self-supervised learning of speech representations which masks latent representations of the raw waveform and solves a contrastive task over quantized speech representations
  • Our experiments show the large potential of pre-training on unlabeled data for speech processing: when using only 10 minutes of labeled training data, or 48 recordings of 12.5 seconds on average, we achieve a WER of 5.7/10.1 on test-clean/other of Librispeech
Methods
  • As unlabeled data the authors consider the Librispeech corpus [39] without transcriptions containing 960 hours of audio (LS-960) or the audio data from LibriVox (LV-60k).
  • For the latter the authors follow the preprocessing of [26] resulting in 53.2k hours of audio.
  • The authors fine-tune the pre-trained models for phoneme recognition on the TIMIT dataset [13]
  • It contains five hours of audio recordings with detailed phoneme labels.
  • The authors use the standard train, dev and test split and follow the standard protocol of collapsing phone labels to 39 classes
Results
  • The authors first evaluate the pre-trained models in settings where the amount of labeled data is limited to get a sense of how the representations learned on unlabeled data can improve low resource settings.
  • The LARGE model pre-trained on LV-60k and fine-tuned on only 10 minutes of labeled data achieves a word error rate of 5.7/10.1 on the Librispeech clean/other test sets.
  • Ten minutes of labeled data corresponds to just 48 recordings with an average length of 12.5 seconds
  • This demonstrates that ultra-low resource speech recognition is possible with self-supervised learning on unlabeled data.
  • The authors' approach improves over previous pre-training work which did not learn quantized audio units jointly [4], reducing WER by a about a third
Conclusion
  • The authors presented wav2vec 2.0, a framework for self-supervised learning of speech representations which masks latent representations of the raw waveform and solves a contrastive task over quantized speech representations.
  • The authors' experiments show the large potential of pre-training on unlabeled data for speech processing: when using only 10 minutes of labeled training data, or 48 recordings of 12.5 seconds on average, the authors achieve a WER of 5.7/10.1 on test-clean/other of Librispeech.
  • The authors' model achieves a new state of the art on the clean 100 hour Librispeech setup and outperforms the previous best result even when using 100 times less labeled data.
  • The approach is effective when large amounts of labeled data are available.
Summary
  • Introduction:

    Neural networks benefit from large quantities of labeled training data. in many settings labeled data is much harder to come by than unlabeled data: current speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance which is not available for the vast majority of the nearly 7,000 languages spoken worldwide [30].
  • Self-supervised learning has emerged as a paradigm to learn general data representations from unlabeled examples and to fine-tune the model on labeled data
  • This has been successful for natural language processing [42, 44, 9] and is an active research area for computer vision [19, 2, 35, 18, 6].
  • The latent representations are fed to a Transformer network to build contextualized representations and the model is trained via a contrastive task where the true latent is to be distinguished from distractors [51, 47, 46, 27] (§ 2)
  • Methods:

    As unlabeled data the authors consider the Librispeech corpus [39] without transcriptions containing 960 hours of audio (LS-960) or the audio data from LibriVox (LV-60k).
  • For the latter the authors follow the preprocessing of [26] resulting in 53.2k hours of audio.
  • The authors fine-tune the pre-trained models for phoneme recognition on the TIMIT dataset [13]
  • It contains five hours of audio recordings with detailed phoneme labels.
  • The authors use the standard train, dev and test split and follow the standard protocol of collapsing phone labels to 39 classes
  • Results:

    The authors first evaluate the pre-trained models in settings where the amount of labeled data is limited to get a sense of how the representations learned on unlabeled data can improve low resource settings.
  • The LARGE model pre-trained on LV-60k and fine-tuned on only 10 minutes of labeled data achieves a word error rate of 5.7/10.1 on the Librispeech clean/other test sets.
  • Ten minutes of labeled data corresponds to just 48 recordings with an average length of 12.5 seconds
  • This demonstrates that ultra-low resource speech recognition is possible with self-supervised learning on unlabeled data.
  • The authors' approach improves over previous pre-training work which did not learn quantized audio units jointly [4], reducing WER by a about a third
  • Conclusion:

    The authors presented wav2vec 2.0, a framework for self-supervised learning of speech representations which masks latent representations of the raw waveform and solves a contrastive task over quantized speech representations.
  • The authors' experiments show the large potential of pre-training on unlabeled data for speech processing: when using only 10 minutes of labeled training data, or 48 recordings of 12.5 seconds on average, the authors achieve a WER of 5.7/10.1 on test-clean/other of Librispeech.
  • The authors' model achieves a new state of the art on the clean 100 hour Librispeech setup and outperforms the previous best result even when using 100 times less labeled data.
  • The approach is effective when large amounts of labeled data are available.
Tables
  • Table1: WER on the Librispeech dev/test sets when training on the Libri-light low-resource labeled data setups of 10 min, 1 hour, 10 hours and the clean 100h subset of Librispeech. Models use either the audio of Librispeech (LS-960) or the larger LibriVox (LV-60k) as unlabeled data. We consider two model sizes: BASE (95m parameters) and LARGE (317m parameters). Prior work used 860 unlabeled hours (LS-860) but the total with labeled data is 960 hours and comparable to our setup
  • Table2: WER on Librispeech when using all labeled data of 960 hours (cf
  • Table3: TIMIT phoneme recognition accuracy in terms of phoneme error rate (PER)
  • Table4: Average WER and standard deviation on combined dev-clean/other of Librispeech for three training seeds. We ablate quantizing the context network input and the targets in the contrastive loss
  • Table5: Ablations on settings for the masking strategy during pre-training. When masking without overlap, we choose starting time steps with p = 0.037 which results in the total number of masked tokens to match the baseline
  • Table6: Fine-tuning hyperparameters timestep mask prob. channel mask prob
  • Table7: Decoding parameters for Librispeech subsets
  • Table8: WER on the Librispeech dev/test sets when training on the Libri-light low-resource labeled data setups (cf
  • Table9: WER on Librispeech when using all 960 hours of Librispeech as labeled data (cf
  • Table10: Top word errors for models trained on 10m, 1h and 10h, 100h, 960h of labeled data and decoded on the Librispeech dev-clean subset without a language model or lexicon (see Table 8 and Table 9 - None). In brackets is the total number of occurrences of each error
  • Table11: Examples of transcription of selected utterances from the dev-clean subset by various models without a language model or lexicon. Capitalized words indicate errors
  • Table12: Ablation of various hyper-parmeter choices. We report average WER and standard deviation on combined dev-clean/other of Librispeech for three seeds of training
Download tables as Excel
Reference
  • J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv, 2016.
    Google ScholarFindings
  • P. Bachman, R. D. Hjelm, and W. Buchwalter. Learning representations by maximizing mutual information across views. In Proc. of NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • A. Baevski and M. Auli. Adaptive input representations for neural language modeling. In Proc.
    Google ScholarLocate open access versionFindings
  • A. Baevski, M. Auli, and A. Mohamed. Effectiveness of self-supervised pre-training for speech recognition. arXiv, abs/1911.03912, 2019.
    Findings
  • A. Baevski, S. Schneider, and M. Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. In Proc. of ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. arXiv, abs/2002.05709, 2020.
    Findings
  • J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord. Unsupervised speech representation learning using wavenet autoencoders. arXiv, abs/1901.08810, 2019.
    Findings
  • Y. Chung, W. Hsu, H. Tang, and J. R. Glass. An unsupervised autoregressive model for speech representation learning. arXiv, abs/1904.03240, 2019.
    Findings
  • J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, abs/1810.04805, 2018.
    Findings
  • S. Dieleman, A. van den Oord, and K. Simonyan. The challenge of realistic music generation: modelling raw audio at scale. arXiv, 2018.
    Google ScholarFindings
  • R. Eloff, A. Nortje, B. van Niekerk, A. Govender, L. Nortje, A. Pretorius, E. Van Biljon, E. van der Westhuizen, L. van Staden, and H. Kamper. Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks. arXiv, abs/1904.07556, 2019.
    Findings
  • A. Fan, E. Grave, and A. Joulin. Reducing transformer depth on demand with structured dropout. In Proc. of ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CDROM. Linguistic Data Consortium, 1993.
    Google ScholarLocate open access versionFindings
  • A. Graves, S. Fernández, and F. Gomez. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proc. of ICML, 2006.
    Google ScholarLocate open access versionFindings
  • E. J. Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office, 1954.
    Google ScholarFindings
  • W. Han, Z. Zhang, Y. Zhang, J. Yu, C.-C. Chiu, J. Qin, A. Gulati, R. Pang, and Y. Wu. Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv, 2020.
    Google ScholarFindings
  • D. Harwath, W.-N. Hsu, and J. Glass. Learning hierarchical discrete linguistic units from visually-grounded speech. In Proc. of ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. arXiv, abs/1911.05722, 2019.
    Findings
  • O. J. Hénaff, A. Razavi, C. Doersch, S. M. A. Eslami, and A. van den Oord. Data-efficient image recognition with contrastive predictive coding. arXiv, abs/1905.09272, 2019.
    Findings
  • D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv, 2016.
    Google ScholarFindings
  • G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger. Deep networks with stochastic depth. arXiv, 2016.
    Google ScholarFindings
  • M. G. A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proc. of AISTATS, 2010.
    Google ScholarLocate open access versionFindings
  • E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv, abs/1611.01144, 2016.
    Findings
  • H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128, Jan. 2011.
    Google ScholarLocate open access versionFindings
  • D. Jiang, X. Lei, W. Li, N. Luo, Y. Hu, W. Zou, and X. Li. Improving transformer-based speech recognition using unsupervised pre-training. arXiv, abs/1910.09932, 2019.
    Findings
  • J. Kahn et al. Libri-light: A benchmark for asr with limited or no supervision. In Proc. of ICASSP, 2020.
    Google ScholarLocate open access versionFindings
  • K. Kawakami, L. Wang, C. Dyer, P. Blunsom, and A. van den Oord. Learning robust and multilingual speech representations. arXiv, 2020.
    Google ScholarFindings
  • D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In Proc. of ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • A. Laptev, R. Korostik, A. Svischev, A. Andrusenko, I. Medennikov, and S. Rybin. You do not need more data: Improving end-to-end speech recognition by text-to-speech data augmentation. arXiv, abs/2005.07157, 2020.
    Findings
  • M. P. Lewis, G. F. Simon, and C. D. Fennig. Ethnologue: Languages of the world, nineteenth edition. Online version: http://www.ethnologue.com, 2016.
    Findings
  • A. H. Liu, T. Tu, H. yi Lee, and L. shan Lee. Towards unsupervised speech recognition and synthesis with quantized speech representation learning. arXiv, 2019.
    Google ScholarFindings
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter, and H. Ney. Rwth asr systems for librispeech: Hybrid vs attention. In Interspeech 2019, 2019.
    Google ScholarFindings
  • C. J. Maddison, D. Tarlow, and T. Minka. A* sampling. In Advances in Neural Information Processing Systems, pages 3086–3094, 2014.
    Google ScholarLocate open access versionFindings
  • I. Misra and L. van der Maaten. Self-supervised learning of pretext-invariant representations. arXiv, 2019.
    Google ScholarFindings
  • A. Mohamed, D. Okhonko, and L. Zettlemoyer. Transformers with convolutional context for ASR. arXiv, abs/1904.11660, 2019.
    Findings
  • M. Ott, S. Edunov, D. Grangier, and M. Auli. Scaling neural machine translation. In Proc. of WMT, 2018.
    Google ScholarLocate open access versionFindings
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proc. of NAACL System Demonstrations, 2019.
    Google ScholarLocate open access versionFindings
  • V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on public domain audio books. In Proc. of ICASSP, pages 5206–5210. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le. Specaugment: A simple data augmentation method for automatic speech recognition. In Proc. of Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • D. S. Park, Y. Zhang, Y. Jia, W. Han, C.-C. Chiu, B. Li, Y. Wu, and Q. V. Le. Improved noisy student training for automatic speech recognition. arXiv, abs/2005.09629, 2020.
    Findings
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. In Proc. of ACL, 2018.
    Google ScholarLocate open access versionFindings
  • V. Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Synnaeve, V. Liptchinsky, and R. Collobert. Wav2letter++: A fast open-source speech recognition system. In Proc. of ICASSP, 2019.
    Google ScholarLocate open access versionFindings
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf, 2018.
    Findings
  • M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio. Light gated recurrent units for speech recognition. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2):92–102, 2018.
    Google ScholarLocate open access versionFindings
  • M. Rivière, A. Joulin, P.-E. Mazaré, and E. Dupoux. Unsupervised pretraining transfers well across languages. arXiv, abs/2002.02848, 2020.
    Findings
  • S. Schneider, A. Baevski, R. Collobert, and M. Auli. wav2vec: Unsupervised pre-training for speech recognition. In Proc. of Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • M. Schuster and K. Nakajima. Japanese and korean voice search. In Proc. of ICASSP, 2012.
    Google ScholarLocate open access versionFindings
  • G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert. End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures. arXiv, abs/1911.08460, 2020.
    Findings
  • A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura. Vqvae unsupervised unit discovery and multi-scale code2spec inverter for zerospeech challenge 2019. arXiv, 1905.11449, 2019.
    Findings
  • A. van den Oord, O. Vinyals, et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pages 6306–6315, 2017.
    Google ScholarLocate open access versionFindings
  • A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv, abs/1807.03748, 2018.
    Findings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Proc. of NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • W. Wang, Q. Tang, and K. Livescu. Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. arXiv, 2020.
    Google ScholarFindings
  • F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli. Pay less attention with lightweight and dynamic convolutions. In Proc. of ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Q. Xu, T. Likhomanenko, J. Kahn, A. Hannun, G. Synnaeve, and R. Collobert. Iterative pseudo-labeling for speech recognition. arXiv, 2020.
    Google ScholarFindings
  • N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve, and E. Dupoux. Learning filterbanks from raw speech for phone recognition. In Proc. of ICASSP, 2018.
    Google ScholarLocate open access versionFindings
  • Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. arXiv, 2020.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments