Pre-training Text Representations as Meta Learning

Cited by: 0|Bibtex|Views117|Links
Keywords:
language modelingproxy taskpre training textagnostic metatraining text representationMore(12+)
Weibo:
We introduce a learning algorithm which regards the pre-training of text representations as modelagnostic meta-learning

Abstract:

Pre-training text representations has recently been shown to significantly improve the state-of-the-art in many natural language processing tasks. The central goal of pre-training is to learn text representations that are useful for subsequent tasks. However, existing approaches are optimized by minimizing a proxy objective, such as the...More

Code:

Data:

0
Introduction
  • The primary goal of pre-training text representations is to acquire useful representations from data that can be effectively used for learning downstream NLP tasks.
  • 2015), discourse coherence (Jernite et al, 2017), etc
  • These objectives are different from the primary goal of pre-training, and result in the mismatch between the pre-training and fine-tuning.
  • This paper explores to alleviate the mismatch between pre-training and fine-tuning processes.
  • The learning process is akin to how humans build upon their prior experience and use them to quickly learn new concepts
Highlights
  • The primary goal of pre-training text representations is to acquire useful representations from data that can be effectively used for learning downstream NLP tasks
  • 2015), discourse coherence (Jernite et al, 2017), etc. These objectives are different from the primary goal of pre-training, and result in the mismatch between the pre-training and fine-tuning
  • We show that there is an intrinsic connection between the multi-task objectives for pre-training and Model-Agnostic Meta-Learning (MAML) (Finn et al, 2017) with a sequence of meta-train steps
  • We adopt question-answer pair matching and question-question pair matching as the pre-training tasks. k=0 denotes we adopt the official BERT-base for fine-tuning, while k ∈ {1, 3, 5, 10, 20} means we adopt the pre-trained model which performs k meta train steps followed by one meta test step during pre-training
  • We introduce a learning algorithm which regards the pre-training of text representations as modelagnostic meta-learning
Results
  • The authors adopt masked language model and sentence prediction as the pre-training tasks.
  • 2 show that the algorithm can obtain better results than official BERT when model converges.
  • The authors fine-tune the model on three datasets: SST-2, SST-5 and QDC for 4 epochs.
  • The authors can observe that the pre-trained models with meta train step k ≥ 1 can obtain better results than BERT at earlier epochs, which indicates that the learning algorithm can learn a better initialization for downstream tasks
Conclusion
  • The authors introduce a learning algorithm which regards the pre-training of text representations as modelagnostic meta-learning.
Summary
  • Introduction:

    The primary goal of pre-training text representations is to acquire useful representations from data that can be effectively used for learning downstream NLP tasks.
  • 2015), discourse coherence (Jernite et al, 2017), etc
  • These objectives are different from the primary goal of pre-training, and result in the mismatch between the pre-training and fine-tuning.
  • This paper explores to alleviate the mismatch between pre-training and fine-tuning processes.
  • The learning process is akin to how humans build upon their prior experience and use them to quickly learn new concepts
  • Results:

    The authors adopt masked language model and sentence prediction as the pre-training tasks.
  • 2 show that the algorithm can obtain better results than official BERT when model converges.
  • The authors fine-tune the model on three datasets: SST-2, SST-5 and QDC for 4 epochs.
  • The authors can observe that the pre-trained models with meta train step k ≥ 1 can obtain better results than BERT at earlier epochs, which indicates that the learning algorithm can learn a better initialization for downstream tasks
  • Conclusion:

    The authors introduce a learning algorithm which regards the pre-training of text representations as modelagnostic meta-learning.
Tables
  • Table1: Statistics of fine-tuning datasets. The first part is unsupervised fine-tunintg datasets and the second part is the supervised fine-tuning datasets
  • Table2: Fine-tuning results on diverse downstream tasks
  • Table3: Fine-tuning results on SNLI with different pretrained models
Download tables as Excel
Related work
  • Pre-training Text Representation Pre-trained text representations from unlabeled corpora have proven effective for many NLP tasks. Earlier works focus on learning embeddings for words (Mikolov et al, 2013; Pennington et al, 2014), the basic idea of which is to represent a word with its surrounding contexts. Recent studies show that pretrained embeddings for longer pieces of text (e.g. a sentence, paraphrase, document) and contextualized word embeddings (Peters et al, 2018; Howard and Ruder, 2018; Radford et al, 2018; Devlin et al, 2019; Liu et al, 2019; Dong et al, 2019) are surprisingly useful, even drive the state-of-theart to achieve human-level accuracy on challenging datasets like SQuAD (Rajpurkar et al, 2016) and SWAG (Zellers et al, 2018). Existing works in this direction typically optimize the pre-trained model using a certain task such as language modeling. However, a natural question is why should we learn representations optimized by language modeling? The goal of pre-training text representation is not language modeling, but learning useful representations for downstream tasks. In this work, we directly optimize the pre-trained model towards this goal and leverage successful meta-learning algorithm MAML.
Reference
  • Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. 1992. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pages 6–8. Univ. of Texas.
    Google ScholarLocate open access versionFindings
  • Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015a. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
    Findings
  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015b. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197.
    Findings
  • Chelsea Finn. 2018. Learning to Learn with Gradients. Ph.D. thesis, UC Berkeley.
    Google ScholarFindings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. 201Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1126–1135.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, and Kyunghyun Cho. 201Meta-learning for lowresource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3622–3631. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units.
    Google ScholarFindings
  • Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wentau Yih, and Xiaodong He. 2018. Natural language to structured query generation via meta-learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 732–738. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yacine Jernite, Samuel R Bowman, and David Sontag. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. arXiv preprint arXiv:1705.00557.
    Findings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.
    Google ScholarLocate open access versionFindings
  • Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504.
    Findings
  • Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
    Google ScholarLocate open access versionFindings
  • Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2017. Meta-learning with temporal convolutions. arXiv preprint arXiv:1707.03141, 2(7).
    Findings
  • Nikita Nangia, Adina Williams, Angeliki Lazaridou, and Samuel R Bowman. 2017. The repeval 2017 shared task: Multi-genre natural language inference with sentence representations. arXiv preprint arXiv:1707.08172.
    Findings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227– 2237. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Tech Report.
    Google ScholarFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
    Findings
  • Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. ICLR 2017.
    Google ScholarLocate open access versionFindings
  • Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Metalearning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850.
    Google ScholarLocate open access versionFindings
  • Jurgen Schmidhuber. 1987. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Ph.D. thesis, Technische Universitat Munchen.
    Google ScholarFindings
  • Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4077–4087. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
    Google ScholarLocate open access versionFindings
  • Wilson L Taylor. 1953. Cloze procedure: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
    Findings
  • Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy. 2017. Large-scale cloze test dataset created by teachers. arXiv preprint arXiv:1711.03225.
    Findings
  • Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326.
    Findings
Your rating :
0

 

Tags
Comments