Multi-Task Self-Supervised Learning for Disfluency Detection

national conference on artificial intelligence, 2020.

Cited by: 3|Bibtex|Views145|Links
Keywords:
conditional random fieldsAutomatic speech recognitiondatum bottleneckprevious methodcompetitive performanceMore(14+)
Weibo:
Experimental results on the commonly used English Switchboard test set show that our approach can achieve competitive performance compared to the previous systems by using less than 1% of the training data

Abstract:

Most existing approaches to disfluency detection heavily rely on human-annotated data, which is expensive to obtain in practice. To tackle the training data bottleneck, we investigate methods for combining multiple self-supervised tasks-i.e., supervised tasks where data can be collected without manual labeling. First, we construct large...More

Code:

Data:

0
Introduction
  • Automatic speech recognition (ASR) outputs often contain various disfluencies, which create barriers to subsequent text processing tasks like parsing, machine translation, and summarization.
  • Disfluency detection (Zayats et al, 2016; Wang et al, 2016; Wu et al, 2015) focuses on recognizing the disfluencies from ASR outputs.
  • Figure 1, a standard annotation of the disfluency structure indicates the reparandum, the interruption point, an optional interregnum and the associated repair (Shriberg, 1994)
Highlights
  • Automatic speech recognition (ASR) outputs often contain various disfluencies, which create barriers to subsequent text processing tasks like parsing, machine translation, and summarization
  • Motivated by the success of self-supervised learning, we propose two self-supervised tasks for disfluency detection task, as shown in Figure 2
  • We propose two self-supervised tasks for disfluency detection to tackle the training data bottleneck
  • Experimental results on the commonly used English Switchboard test set show that our approach can achieve competitive performance compared to the previous systems by using less than 1% (1000 sentences) of the training data
  • Our method trained on the full dataset significantly outperforms previous methods, reducing the error by 21% on English Switchboard
  • We propose two self-supervised tasks to tackle the training data bottleneck
Methods
  • Bi-LSTM (Zayats et al, 2016).
  • Other types of reparandums such as repair are more complex (Zayats et al, 2016; Ostendorf and Hahn, 2013).
  • In order to better understand model performances, the authors evaluate the model’s ability to detect repetition vs nonrepetition reparandum.
  • All three models achieve high scores on repetition reparandum.
  • The authors' selfsupervised model is much better in predicting nonrepetitions compared to the two baseline methods.
  • The authors conjecture that the self-supervised tasks can capture more sentence-level structural information
Results
  • Experimental results on the commonly used English Switchboard test set show that the approach can achieve competitive performance compared to the previous systems by using less than 1% (1000 sentences) of the training data.
  • The authors' self-supervised method achieves almost 20 point improvements over transition-based method when using less than 1% (1000 sentences) human-annotated disfluency detection data.
  • The authors' method trained on the full dataset significantly outperforms previous methods, reducing the error by 21% on English Switchboard
Conclusion
  • The authors propose two self-supervised tasks to tackle the training data bottleneck.
  • Experimental results on the commonly used English Switchboard test set show that the approach can achieve competitive performance compared to the previous systems by using less than 1% (1000 sentences) of the training data.
  • The authors' method trained on the full dataset significantly outperforms previous methods, reducing the error by 21% on English Switchboard
Summary
  • Introduction:

    Automatic speech recognition (ASR) outputs often contain various disfluencies, which create barriers to subsequent text processing tasks like parsing, machine translation, and summarization.
  • Disfluency detection (Zayats et al, 2016; Wang et al, 2016; Wu et al, 2015) focuses on recognizing the disfluencies from ASR outputs.
  • Figure 1, a standard annotation of the disfluency structure indicates the reparandum, the interruption point, an optional interregnum and the associated repair (Shriberg, 1994)
  • Methods:

    Bi-LSTM (Zayats et al, 2016).
  • Other types of reparandums such as repair are more complex (Zayats et al, 2016; Ostendorf and Hahn, 2013).
  • In order to better understand model performances, the authors evaluate the model’s ability to detect repetition vs nonrepetition reparandum.
  • All three models achieve high scores on repetition reparandum.
  • The authors' selfsupervised model is much better in predicting nonrepetitions compared to the two baseline methods.
  • The authors conjecture that the self-supervised tasks can capture more sentence-level structural information
  • Results:

    Experimental results on the commonly used English Switchboard test set show that the approach can achieve competitive performance compared to the previous systems by using less than 1% (1000 sentences) of the training data.
  • The authors' self-supervised method achieves almost 20 point improvements over transition-based method when using less than 1% (1000 sentences) human-annotated disfluency detection data.
  • The authors' method trained on the full dataset significantly outperforms previous methods, reducing the error by 21% on English Switchboard
  • Conclusion:

    The authors propose two self-supervised tasks to tackle the training data bottleneck.
  • Experimental results on the commonly used English Switchboard test set show that the approach can achieve competitive performance compared to the previous systems by using less than 1% (1000 sentences) of the training data.
  • The authors' method trained on the full dataset significantly outperforms previous methods, reducing the error by 21% on English Switchboard
Tables
  • Table1: Different types of disfluencies
  • Table2: Experiment results on English Switchboard data, where “Full” means the results using 100% human-annotated data, and “1000 sents” means the results using less than 1% (1000 sentences) humanannotated data
  • Table3: Comparison with previous state-of-theart methods on the test set of English Switchboard. “Full” means using 100% human-annotated data for fine-tuning, and “1000 sents” means using less than 1% (1000 sentences) human-annotated data for fine-tuning
  • Table4: Results of feature ablation experiments on English Switchboard test data. “randominitial” means training transformer network on gold disfluency detection data with random initialization
  • Table5: Comparison with BERT. “random-initial” means training transformer network on gold disfluency detection data with random initialization. “combine” means concatenating hidden representations of BERT and our self-supervised models for fine-tuning
  • Table6: F-score of different types of reparandums on English Switchboard test data
Download tables as Excel
Related work
Reference
  • Pulkit Agrawal, Joao Carreira, and Jitendra Malik. 2015. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision, pages 37–45.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155.
    Google ScholarLocate open access versionFindings
  • Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 164–169, Valencia, Spain. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Eugene Charniak and Mark Johnson. 2001. Edit detection and parsing for transcribed speech. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pages 1–9. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2015. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430.
    Google ScholarLocate open access versionFindings
  • James Ferguson, Greg Durrett, and Dan Klein. 2015. Disfluency detection with a semi-markov model and prosodic features. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 257–262. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. 2017. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3636–3645.
    Google ScholarLocate open access versionFindings
  • Kallirroi Georgila. 200Using integer linear programming for detecting speech disfluencies. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 109–112. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • John J Godfrey, Edward C Holliman, and Jane McDaniel. 1992. Switchboard: Telephone speech corpus for research and development. In icassp, pages 517–520. IEEE.
    Google ScholarLocate open access versionFindings
  • Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint arXiv:1606.08415.
    Findings
  • Matthew Honnibal and Mark Johnson. 2014. Joint incremental disfluency detection and dependency parsing. Transactions of the Association for Computational Linguistics, 2:131–142.
    Google ScholarLocate open access versionFindings
  • Julian Hough and David Schlangen. 2015. Recurrent neural networks for incremental disfluency detection. In Sixteenth Annual Conference of the International Speech Communication Association.
    Google ScholarLocate open access versionFindings
  • Paria Jamshid Lou, Peter Anderson, and Mark Johnson. 2018. Disfluency detection using auto-correlational neural networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4610–4619, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mark Johnson and Eugene Charniak. 2004. A tagbased noisy channel model of speech repairs. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 33. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
    Findings
  • Paria Jamshid Lou and Mark Johnson. 20Disfluency detection using a noisy channel model and a deep neural language model. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
    Google ScholarLocate open access versionFindings
  • Hector Martınez Alonso and Barbara Plank. 2017. When is multitask learning effective? semantic sequence prediction under varying data conditions. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 44–53, Valencia, Spain. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. CoRR.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
    Google ScholarLocate open access versionFindings
  • Mari Ostendorf and Sangyun Hahn. 2013. A sequential repetition model for improved disfluency detection. In INTERSPEECH, pages 2624–2628.
    Google ScholarLocate open access versionFindings
  • Hao Peng, Sam Thomson, and Noah A. Smith. 2017. Deep multitask learning for semantic dependency parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2037–2048, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Xian Qian and Yang Liu. 2013. Disfluency detection using multi-step stacked learning. In HLT-NAACL, pages 820–825.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, Technical report, OpenAI.
    Google ScholarFindings
  • Mohammad Sadegh Rasooli and Joel R Tetreault. 2013. Joint parsing and disfluency detection in linear time. In EMNLP, pages 124–129.
    Google ScholarLocate open access versionFindings
  • Elizabeth Ellen Shriberg. 1994. Preliminaries to a theory of speech disfluencies. Ph.D. thesis, Citeseer.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Feng Wang, Wei Chen, Zhen Yang, Qianqian Dong, Shuang Xu, and Bo Xu. 2018. Semi-supervised disfluency detection. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3529–3538. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Shaolei Wang, Wanxiang Che, and Ting Liu. 2016. A neural attention model for disfluency detection. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 278–287, Osaka, Japan. The COLING 2016 Organizing Committee.
    Google ScholarLocate open access versionFindings
  • Shaolei Wang, Wanxiang Che, Yue Zhang, Meishan Zhang, and Ting Liu. 2017. Transition-based disfluency detection using lstms. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2785–2794.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang and Abhinav Gupta. 2015. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802.
    Google ScholarLocate open access versionFindings
  • Shuangzhi Wu, Dongdong Zhang, Ming Zhou, and Tiejun Zhao. 2015. Efficient disfluency detection with transition-based parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 495–503. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Masashi Yoshikawa, Hiroyuki Shindo, and Yuji Matsumoto. 2016. Joint transition-based dependency parsing and disfluency detection for automatic speech recognition texts. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1036–1041.
    Google ScholarLocate open access versionFindings
  • Vicky Zayats and Mari Ostendorf. 2018. Robust crossdomain disfluency detection with pattern match networks. arXiv preprint arXiv:1811.07236.
    Findings
  • Vicky Zayats and Mari Ostendorf. 2019. Giving attention to the unexpected: Using prosody innovations in disfluency detection. arXiv preprint arXiv:1904.04388.
    Findings
  • Vicky Zayats, Mari Ostendorf, and Hannaneh Hajishirzi. 2016. Disfluency detection using a bidirectional lstm. arXiv preprint arXiv:1604.03209.
    Findings
  • Victoria Zayats, Mari Ostendorf, and Hannaneh Hajishirzi. 2014. Multi-domain disfluency and repair detection. In Fifteenth Annual Conference of the International Speech Communication Association.
    Google ScholarLocate open access versionFindings
  • Simon Zwarts, Mark Johnson, and Robert Dale. 2010. Detecting speech repairs incrementally using a noisy channel approach. In Proceedings of the 23rd international conference on computational linguistics, pages 1371–1378. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments