Pre-training Text Representations as Meta Learning
Keywords:
Weibo:
Abstract:
Pre-training text representations has recently been shown to significantly improve the state-of-the-art in many natural language processing tasks. The central goal of pre-training is to learn text representations that are useful for subsequent tasks. However, existing approaches are optimized by minimizing a proxy objective, such as the...More
Code:
Data:
Introduction
- The primary goal of pre-training text representations is to acquire useful representations from data that can be effectively used for learning downstream NLP tasks.
- 2015), discourse coherence (Jernite et al, 2017), etc
- These objectives are different from the primary goal of pre-training, and result in the mismatch between the pre-training and fine-tuning.
- This paper explores to alleviate the mismatch between pre-training and fine-tuning processes.
- The learning process is akin to how humans build upon their prior experience and use them to quickly learn new concepts
Highlights
- The primary goal of pre-training text representations is to acquire useful representations from data that can be effectively used for learning downstream NLP tasks
- 2015), discourse coherence (Jernite et al, 2017), etc. These objectives are different from the primary goal of pre-training, and result in the mismatch between the pre-training and fine-tuning
- We show that there is an intrinsic connection between the multi-task objectives for pre-training and Model-Agnostic Meta-Learning (MAML) (Finn et al, 2017) with a sequence of meta-train steps
- We adopt question-answer pair matching and question-question pair matching as the pre-training tasks. k=0 denotes we adopt the official BERT-base for fine-tuning, while k ∈ {1, 3, 5, 10, 20} means we adopt the pre-trained model which performs k meta train steps followed by one meta test step during pre-training
- We introduce a learning algorithm which regards the pre-training of text representations as modelagnostic meta-learning
Results
- The authors adopt masked language model and sentence prediction as the pre-training tasks.
- 2 show that the algorithm can obtain better results than official BERT when model converges.
- The authors fine-tune the model on three datasets: SST-2, SST-5 and QDC for 4 epochs.
- The authors can observe that the pre-trained models with meta train step k ≥ 1 can obtain better results than BERT at earlier epochs, which indicates that the learning algorithm can learn a better initialization for downstream tasks
Conclusion
- The authors introduce a learning algorithm which regards the pre-training of text representations as modelagnostic meta-learning.
Summary
Introduction:
The primary goal of pre-training text representations is to acquire useful representations from data that can be effectively used for learning downstream NLP tasks.- 2015), discourse coherence (Jernite et al, 2017), etc
- These objectives are different from the primary goal of pre-training, and result in the mismatch between the pre-training and fine-tuning.
- This paper explores to alleviate the mismatch between pre-training and fine-tuning processes.
- The learning process is akin to how humans build upon their prior experience and use them to quickly learn new concepts
Results:
The authors adopt masked language model and sentence prediction as the pre-training tasks.- 2 show that the algorithm can obtain better results than official BERT when model converges.
- The authors fine-tune the model on three datasets: SST-2, SST-5 and QDC for 4 epochs.
- The authors can observe that the pre-trained models with meta train step k ≥ 1 can obtain better results than BERT at earlier epochs, which indicates that the learning algorithm can learn a better initialization for downstream tasks
Conclusion:
The authors introduce a learning algorithm which regards the pre-training of text representations as modelagnostic meta-learning.
Tables
- Table1: Statistics of fine-tuning datasets. The first part is unsupervised fine-tunintg datasets and the second part is the supervised fine-tuning datasets
- Table2: Fine-tuning results on diverse downstream tasks
- Table3: Fine-tuning results on SNLI with different pretrained models
Related work
- Pre-training Text Representation Pre-trained text representations from unlabeled corpora have proven effective for many NLP tasks. Earlier works focus on learning embeddings for words (Mikolov et al, 2013; Pennington et al, 2014), the basic idea of which is to represent a word with its surrounding contexts. Recent studies show that pretrained embeddings for longer pieces of text (e.g. a sentence, paraphrase, document) and contextualized word embeddings (Peters et al, 2018; Howard and Ruder, 2018; Radford et al, 2018; Devlin et al, 2019; Liu et al, 2019; Dong et al, 2019) are surprisingly useful, even drive the state-of-theart to achieve human-level accuracy on challenging datasets like SQuAD (Rajpurkar et al, 2016) and SWAG (Zellers et al, 2018). Existing works in this direction typically optimize the pre-trained model using a certain task such as language modeling. However, a natural question is why should we learn representations optimized by language modeling? The goal of pre-training text representation is not language modeling, but learning useful representations for downstream tasks. In this work, we directly optimize the pre-trained model towards this goal and leverage successful meta-learning algorithm MAML.
Reference
- Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. 1992. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pages 6–8. Univ. of Texas.
- Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015a. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
- Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015b. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics.
- Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197.
- Chelsea Finn. 2018. Learning to Learn with Gradients. Ph.D. thesis, UC Berkeley.
- Chelsea Finn, Pieter Abbeel, and Sergey Levine. 201Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1126–1135.
- Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, and Kyunghyun Cho. 201Meta-learning for lowresource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3622–3631. Association for Computational Linguistics.
- Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units.
- Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339. Association for Computational Linguistics.
- Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wentau Yih, and Xiaodong He. 2018. Natural language to structured query generation via meta-learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 732–738. Association for Computational Linguistics.
- Yacine Jernite, Samuel R Bowman, and David Sontag. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. arXiv preprint arXiv:1705.00557.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.
- Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504.
- Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2017. Meta-learning with temporal convolutions. arXiv preprint arXiv:1707.03141, 2(7).
- Nikita Nangia, Adina Williams, Angeliki Lazaridou, and Samuel R Bowman. 2017. The repeval 2017 shared task: Multi-genre natural language inference with sentence representations. arXiv preprint arXiv:1707.08172.
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543.
- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227– 2237. Association for Computational Linguistics.
- Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Tech Report.
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. ICLR 2017.
- Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Metalearning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850.
- Jurgen Schmidhuber. 1987. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Ph.D. thesis, Technische Universitat Munchen.
- Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4077–4087. Curran Associates, Inc.
- Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Wilson L Taylor. 1953. Cloze procedure: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433.
- Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638.
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy. 2017. Large-scale cloze test dataset created by teachers. arXiv preprint arXiv:1711.03225.
- Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326.
Tags
Comments