Multi-task self-supervised learning for Robust Speech Recognition

Zhong Jianyuan
Zhong Jianyuan
Pascual Santiago
Pascual Santiago
Swietojanski Pawel
Swietojanski Pawel
Monteiro Joao
Monteiro Joao
Trmal Jan
Trmal Jan

ICASSP, pp. 6989-6993, 2020.

Cited by: 7|Bibtex|Views120|Links
EI
Keywords:
mean squared errorlog power spectrumphone error ratesLocal info maxspeech representationMore(13+)
Weibo:
The proposed problemagnostic speech encoder+ architecture is based on an online speech distortion module, a convolutional encoder coupled with a quasi-recurrent neural network layer, and a set of workers solving self-supervised problems

Abstract:

Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to sol...More

Code:

Data:

0
Introduction
  • Deep learning relies on hierarchical representations that are commonly learned in a supervised way from large corpora.
  • Access to such annotated corpora, is often expensive making of paramount interest the study of techniques able to extract knowledge from unlabelled data.
  • The authors' recent attempt to learn speech representations with a multi-task self-supervised approach led them to the development of a problemagnostic speech encoder (PASE) [15], that turned out to learn meaningful speech information such as speaker identities, phonemes, and emotions.
  • The authors' initial PASE variant provided promising results in several small-scale speech tasks, it was not explicitly designed to learn features robust against noise and reverberation
Highlights
  • Deep learning relies on hierarchical representations that are commonly learned in a supervised way from large corpora
  • problemagnostic speech encoder (PASE) relies on a convolutional encoder followed by an ensemble of small neural networks, called workers, that are jointly trained to solve multiple self-supervised tasks
  • We describe how PASE+ is pre-trained in a self-supervised manner, with a particular focus on the main improvements proposed on top of the original [15]
  • The first row shows the results achieved with the original version of PASE [15], which was trained on 10 hours on LibriSpeech only
  • The proposed PASE+ architecture is based on an online speech distortion module, a convolutional encoder coupled with a quasi-recurrent neural network (QRNN) layer, and a set of workers solving self-supervised problems
  • PASE+ turned out to significantly outperform standard acoustic features on different speech recognition tasks, and offering further gains when end-to-end optimized with the target acoustic model objective
Results
  • Results on

    TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features.
  • Tab. 4.1, reports the PER(%) obtained on the clean and noisy versions of TIMIT when the authors progressively improve the original version of PASE.
  • For this experiment, PASE is frozen and it is used as a simple feature extractor.
  • The first row shows the results achieved with the original version of PASE [15], which was trained on 10 hours on LibriSpeech only.
  • Acts as a powerful regularizer that helps especially when the supervised classifier is trained with a
Conclusion
  • This work studied a multi-task self-supervised approach for robust speech recognition.
  • PASE+ turned out to significantly outperform standard acoustic features on different speech recognition tasks, and offering further gains when end-to-end optimized with the target acoustic model objective.
  • As supported by the evidence in the carried experiments, in future works the authors will explore its usability to other downstream tasks as well as in sequence-to-sequence neural speech recognition
Summary
  • Introduction:

    Deep learning relies on hierarchical representations that are commonly learned in a supervised way from large corpora.
  • Access to such annotated corpora, is often expensive making of paramount interest the study of techniques able to extract knowledge from unlabelled data.
  • The authors' recent attempt to learn speech representations with a multi-task self-supervised approach led them to the development of a problemagnostic speech encoder (PASE) [15], that turned out to learn meaningful speech information such as speaker identities, phonemes, and emotions.
  • The authors' initial PASE variant provided promising results in several small-scale speech tasks, it was not explicitly designed to learn features robust against noise and reverberation
  • Results:

    Results on

    TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features.
  • Tab. 4.1, reports the PER(%) obtained on the clean and noisy versions of TIMIT when the authors progressively improve the original version of PASE.
  • For this experiment, PASE is frozen and it is used as a simple feature extractor.
  • The first row shows the results achieved with the original version of PASE [15], which was trained on 10 hours on LibriSpeech only.
  • Acts as a powerful regularizer that helps especially when the supervised classifier is trained with a
  • Conclusion:

    This work studied a multi-task self-supervised approach for robust speech recognition.
  • PASE+ turned out to significantly outperform standard acoustic features on different speech recognition tasks, and offering further gains when end-to-end optimized with the target acoustic model objective.
  • As supported by the evidence in the carried experiments, in future works the authors will explore its usability to other downstream tasks as well as in sequence-to-sequence neural speech recognition
Tables
  • Table1: List of the distortions used in the speech contamination module (each one activated independently with probability of p)
  • Table2: Phone error rate (PER) obtained on the TIMIT corpus (clean and noisy) with different versions of PASE
  • Table3: Phone error rate (PER) obtained on the TIMIT and DIRHA corpora (noise+reverb versions) with different input features
  • Table4: CHiME-5 WERs(%) on distant beamformed microphones
Download tables as Excel
Funding
  • The work reported here was started at JSALT 2019, and supported by JHU with gifts from Amazon, Facebook, Google, and Microsoft
  • This work was also supported by NSERC, Samsung, Compute Canada, NCI/Intersect Australia and the project TEC2015-69266-P (MINECO/FEDER, UE)
Reference
  • Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Proc. of NIPS. 2006.
    Google ScholarLocate open access versionFindings
  • G.E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. of ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. of NIPS. 2014.
    Google ScholarLocate open access versionFindings
  • C. Doersch and A. Zisserman, “Multi-task self-supervised visual learning,” in Proc. of ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: Unsupervised learning using temporal order verification,” in Proc. of ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in Proc. of ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • M. Noroozi and P. Favaro., “Unsupervised learning of visual representations by solving jigsaw puzzle,” in Proc. of ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018.
    Findings
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A lite BERT for self-supervised learning of language representations,” in Proc. of ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • A. Jansen, M. Plakal, R. Pandya, D. P. W. Ellis, S. Hershey, J. Liu, R. C. Moore, and R. A. Saurous, “Unsupervised learning of semantic audio representations,” in Proc. of ICASP, 2018.
    Google ScholarLocate open access versionFindings
  • J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Unsupervised speech representation learning using WaveNet autoencoders,” CoRR, vol. abs/1901.08810, 2019.
    Findings
  • A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” Arxiv, 2018.
    Google ScholarLocate open access versionFindings
  • M. Ravanelli and Y. Bengio, “Learning speaker representations with mutual information,” in Proc. of Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • S. Pascual, M. Ravanelli, J. Serra, A. Bonafonte, and Y. Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” in Proc. of Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • J. Bradbury, S. Merity, C. Xiong, and R. Socher, “QuasiRecurrent Neural Networks,” Proc. of ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • P. Bell, P. Swietojanski, and S. Renals, “Multi-level adaptive networks in tandem and hybrid ASR systems,” in Proc. of ICASSP, 2013, pp. 6975–6979.
    Google ScholarLocate open access versionFindings
  • J. B. Allen and D. A. Berkley, “Image method for efficiently simulating smallroom acoustics,” JASA, vol. 65, no. 4, pp. 943–950, 1979.
    Google ScholarLocate open access versionFindings
  • M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, and M. Omologo, “The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments,” in Proc. of ASRU 2015.
    Google ScholarLocate open access versionFindings
  • M. Ravanelli and M. Omologo, “Contaminated speech training methods for robust DNN-HMM distant speech recognition,” in Proc. of Interspeech, 2015.
    Google ScholarLocate open access versionFindings
  • B. Zoph, C.-C. Chiu, D. S. Park, E. D. Cubuk, Q. V. Le, W. Chan, and Y. Zhang, “SpecAugment: A simple augmentation method for automatic speech recognition,” in Proc of Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • M. Ravanelli and Y. Bengio, “Interpretable convolutional filters with SincNet,” Proc. of IRASL@NIPS, 2018.
    Google ScholarLocate open access versionFindings
  • S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. of ICML, 2015.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in Proc. of ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • R. Schluter, I. Bezrukov, H. Wagner, and H. Ney, “Gammatone features and feature combination for large vocabulary speech recognition,” in Proc. of ICASSP, 2007.
    Google ScholarLocate open access versionFindings
  • M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in Proc. of ICML, 2018.
    Google ScholarLocate open access versionFindings
  • D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” Proc. of ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • I. Albuquerque, J. Monteiro, T. Doan, B. Considine, T. Falk, and I. Mitliagkas, “Multi-objective training of generative adversarial networks with multiple discriminators,” in Proc. of ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • D.P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. of ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • R. Ge, S. M. Kakade, R. Kidambi, and P. Netrapalli, “Rethinking learning rate schedules for stochastic optimization,” in OpenReview, 2019.
    Google ScholarLocate open access versionFindings
  • V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. of ICASSP, 2015.
    Google ScholarLocate open access versionFindings
  • J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM,” 1993.
    Google ScholarFindings
  • M. Ravanelli, P. Svaizer, and M. Omologo, “Realistic multimicrophone data simulation for distant speech recognition,” in Proc. of Interspeech, 2016.
    Google ScholarLocate open access versionFindings
  • E. Vincent J. Barker, S. Watanabe and J. Trmal, “The fifth ‘CHiME speech separation and recognition challenge: Dataset, task and baselines,” in Proc. of Interspeech, 2018.
    Google ScholarLocate open access versionFindings
  • M. Ravanelli, T. Parcollet, and Y. Bengio, “The PyTorch-Kaldi Speech Recognition Toolkit,” in Proc. of ICASSP, 2019.
    Google ScholarLocate open access versionFindings
  • M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Improving speech recognition by revising gated recurrent units,” in Proc. of Interspeech, 2017.
    Google ScholarLocate open access versionFindings
  • D. Povey et al., “The Kaldi Speech Recognition Toolkit,” in Proc. of ASRU, 2011.
    Google ScholarLocate open access versionFindings
  • A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE TASLP, vol. 37, no. 3, pp. 328–339, 1989.
    Google ScholarLocate open access versionFindings
  • V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in In Proc. of Interspeech, 2015.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments