XLNet: Generalized Autoregressive Pretraining for Language Understanding

    NeurIPS, pp. 5754-5764, 2019.

    Cited by: 215|Bibtex|Views313|Links
    EI
    Keywords:
    sentiment analysisdocument ranking
    Wei bo:
    XLNet is a generalized AR pretraining method that uses a permutation language modeling objective to combine the advantages of AR and AE methods

    Abstract:

    With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain...More

    Code:

    Data:

    Introduction
    • Unsupervised representation learning has been highly successful in the domain of natural language processing [7, 22, 27, 28, 10]
    • These methods first pretrain neural networks on large-scale unlabeled text corpora, and finetune the models or representations on downstream tasks.
    • Downstream language understanding tasks often require bidirectional context information
    • This results in a gap between AR language modeling and effective pretraining
    Highlights
    • Unsupervised representation learning has been highly successful in the domain of natural language processing [7, 22, 27, 28, 10]
    • Faced with the pros and cons of existing language pretraining objectives, in this work, we propose XLNet, a generalized autoregressive method that leverages the best of both AR language modeling and AE while avoiding their limitations
    • Borrowing ideas from orderless NADE [32], we propose the permutation language modeling objective that retains the benefits of AR models and allows models to capture bidirectional contexts
    • Relative Segment Encodings Architecturally, different from BERT that adds an absolute segment embedding to the word embedding at each position, we extend the idea of relative encodings from Transformer-XL to encode the segments
    • XLNet is a generalized AR pretraining method that uses a permutation language modeling objective to combine the advantages of AR and AE methods
    Methods
    • The authors first review and compare the conventional AR language modeling and BERT for language pretraining.
    • Given a text sequence x = [x1, · · · , xT ], AR language modeling performs pretraining by maximizing the likelihood under the forward autoregressive factorization: max ✓ log p✓ (x) =.
    • X T log t=1 p✓ t=1.
    • Pexp h✓(x1:t 1)>e x0 exp (h✓(x1:t 1)>e(x0 )) (1)
    Results
    • Accuracy Middle High Model NDCG@20 ERR@20 GPT [28] BERT [25] BERT+DCMN⇤ [38] RoBERTa [21].
    • 62.9 57.4 DRMM [13].
    • 76.6 70.1 KNRM [8].
    • 79.5 71.8 Conv [8] 86.5 81.8 BERT† XLNet
    Conclusion
    • Comparing Eq (2) and (5), the authors observe that both BERT and XLNet perform partial prediction, i.e., only predicting a subset of tokens in the sequence
    • This is a necessary choice for BERT because if all tokens are masked, it is impossible to make any meaningful predictions.
    • To better understand the difference, let’s consider a concrete example [New, York, is, a, city]
    • Suppose both BERT and XLNet select the two tokens [New, York] as the prediction targets and maximize log p(New York | is a city).
    • XLNet achieves substantial improvement over previous pretraining objectives on various tasks
    Summary
    • Introduction:

      Unsupervised representation learning has been highly successful in the domain of natural language processing [7, 22, 27, 28, 10]
    • These methods first pretrain neural networks on large-scale unlabeled text corpora, and finetune the models or representations on downstream tasks.
    • Downstream language understanding tasks often require bidirectional context information
    • This results in a gap between AR language modeling and effective pretraining
    • Methods:

      The authors first review and compare the conventional AR language modeling and BERT for language pretraining.
    • Given a text sequence x = [x1, · · · , xT ], AR language modeling performs pretraining by maximizing the likelihood under the forward autoregressive factorization: max ✓ log p✓ (x) =.
    • X T log t=1 p✓ t=1.
    • Pexp h✓(x1:t 1)>e x0 exp (h✓(x1:t 1)>e(x0 )) (1)
    • Results:

      Accuracy Middle High Model NDCG@20 ERR@20 GPT [28] BERT [25] BERT+DCMN⇤ [38] RoBERTa [21].
    • 62.9 57.4 DRMM [13].
    • 76.6 70.1 KNRM [8].
    • 79.5 71.8 Conv [8] 86.5 81.8 BERT† XLNet
    • Conclusion:

      Comparing Eq (2) and (5), the authors observe that both BERT and XLNet perform partial prediction, i.e., only predicting a subset of tokens in the sequence
    • This is a necessary choice for BERT because if all tokens are masked, it is impossible to make any meaningful predictions.
    • To better understand the difference, let’s consider a concrete example [New, York, is, a, city]
    • Suppose both BERT and XLNet select the two tokens [New, York] as the prediction targets and maximize log p(New York | is a city).
    • XLNet achieves substantial improvement over previous pretraining objectives on various tasks
    Tables
    • Table1: Fair comparison with BERT. All models are trained using the same data and hyperparameters as in BERT. We use the best of 3 BERT variants for comparison; i.e., the original BERT, BERT with whole word masking, and BERT without next sentence prediction
    • Table2: Comparison with state-of-the-art results on the test set of RACE, a reading comprehension task, and on ClueWeb09-B, a document ranking task. ⇤ indicates using ensembles. † indicates our implementations. “Middle” and “High” in RACE are two subsets representing middle and high school difficulty levels. All BERT, RoBERTa, and XLNet results are obtained with a 24-layer architecture with similar model sizes (aka BERT-Large)
    • Table3: Results on SQuAD, a reading comprehension dataset. † marks our runs with the official code. We are not able to obtain the test results on SQuAD at the time of submission due to the complicated submission process. We will make the results public when they are available
    • Table4: Comparison with state-of-the-art error rates on the test sets of several text classification datasets. All BERT and XLNet results are obtained with a 24-layer architecture with similar model sizes (aka BERT-Large)
    • Table5: Results on GLUE. ⇤ indicates using ensembles, and † denotes single-task results in a multi-task row. All dev results are the median of 10 runs. The upper section shows direct comparison on dev data and the lower section shows comparison with state-of-the-art results on the public leaderboard
    • Table6: The results of BERT on RACE are taken from [<a class="ref-link" id="c38" href="#r38">38</a>]. We run BERT on the other datasets using the official implementation and the same hyperparameter search space as XLNet. K is a hyperparameter to control the optimization difficulty (see Section 2.3)
    Download tables as Excel
    Related work
    • The idea of permutation-based AR modeling has been explored in [32, 12], but there are several key differences. Firstly, previous models aim to improve density estimation by baking an “orderless” inductive bias into the model while XLNet is motivated by enabling AR language models to learn bidirectional contexts. Secondly, XLNet emphasizes the necessity of being order-aware with (relative) positional encodings, because an orderless model is degenerated to bag-of-words, lacking basic expressiveness. Moreover, none of previous permutation-based models identifies or deals with the target aware distribution problem.

      Another related idea is to perform autoregressive denoising in the context of text generation [11], which only considers a fixed order though.
    Funding
    • ZY and RS were supported by the Office of Naval Research grant N000141812861, the National Science Foundation (NSF) grant IIS1763562, the Nvidia fellowship, and the Siebel scholarship
    • ZD and YY were supported in part by NSF under the grant IIS-1546329 and by the DOE-Office of Science under the grant ASCR #KJ040201
    Reference
    • Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. arXiv preprint arXiv:1808.04444, 2018.
      Findings
    • Anonymous. Bam! born-again multi-task networks for natural language understanding. anonymous preprint under review, 2018.
      Google ScholarFindings
    • Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.
      Findings
    • Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In Advances in Neural Information Processing Systems, pages 400–406, 2000.
      Google ScholarLocate open access versionFindings
    • Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. Clueweb09 data set, 2009.
      Google ScholarFindings
    • Common Crawl. Common crawl. URl: http://http://commoncrawl.org, 2019.
      Findings
    • Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087, 2015.
      Google ScholarLocate open access versionFindings
    • Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the eleventh ACM international conference on web search and data mining, pages 126–134. ACM, 2018.
      Google ScholarLocate open access versionFindings
    • Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
      Findings
    • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
      Findings
    • William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: better text generation via filling in the_. arXiv preprint arXiv:1801.07736, 2018.
      Findings
    • Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889, 2015.
      Google ScholarLocate open access versionFindings
    • Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 55–64. ACM, 2016.
      Google ScholarLocate open access versionFindings
    • Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018.
      Findings
    • Rie Johnson and Tong Zhang. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 562–570, 2017.
      Google ScholarLocate open access versionFindings
    • Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. A surprisingly robust trick for winograd schema challenge. arXiv preprint arXiv:1905.06290, 2019.
      Findings
    • Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
      Findings
    • Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
      Findings
    • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
      Findings
    • Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019.
      Findings
    • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
      Findings
    • Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305, 2017.
      Google ScholarLocate open access versionFindings
    • Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semisupervised text classification. arXiv preprint arXiv:1605.07725, 2016.
      Findings
    • Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
      Findings
    • Xiaoman Pan, Kai Sun, Dian Yu, Heng Ji, and Dong Yu. Improving question answering with external knowledge. arXiv preprint arXiv:1902.00993, 2019.
      Findings
    • Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword fifth edition, linguistic data consortium. Technical report, Technical Report. Linguistic Data Consortium, Philadelphia, Tech. Rep., 2011.
      Google ScholarFindings
    • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.
      Findings
    • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openaiassets/research-covers/languageunsupervised/language understanding paper.pdf, 2018.
      Findings
    • Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
      Findings
    • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
      Findings
    • Devendra Singh Sachan, Manzil Zaheer, and Ruslan Salakhutdinov. Revisiting lstm networks for semi-supervised text classification via mixed objective function. 2018.
      Google ScholarFindings
    • Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184– 7220, 2016.
      Google ScholarLocate open access versionFindings
    • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
      Google ScholarLocate open access versionFindings
    • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR.
      Google ScholarLocate open access versionFindings
    • Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised data augmentation. arXiv preprint arXiv:1904.12848, 2019.
      Findings
    • Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval, pages 55–64. ACM, 2017.
      Google ScholarLocate open access versionFindings
    • Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017.
      Findings
    • Shuailiang Zhang, Hai Zhao, Yuwei Wu, Zhuosheng Zhang, Xi Zhou, and Xiang Zhou. Dual comatching network for multi-choice reading comprehension. arXiv preprint arXiv:1901.09381, 2019.
      Findings
    • Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657, 2015.
      Google ScholarLocate open access versionFindings
    • Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments