ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ICLR, 2020.

Cited by: 44|Bibtex|Views580|Links
EI
Keywords:
Natural Language Processing Representation Learning
Weibo:
We find that BERT performance is being slightly harmed from the pre-train fine-tune mismatch from tokens, as Replace masked language modeling slightly outperforms BERT

Abstract:

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we ...More
Introduction
  • Current state-of-the-art representation learning methods for language can be viewed as learning denoising autoencoders (Vincent et al, 2008)
  • They select a small subset of the unlabeled input sequence, mask the identities of those tokens (e.g., BERT; Devlin et al (2019)) or attention to those tokens (e.g., XLNet; Yang et al (2019)), and train the network to recover the original input.
Highlights
  • Current state-of-the-art representation learning methods for language can be viewed as learning denoising autoencoders (Vincent et al, 2008)
  • We find that ELECTRA is greatly benefiting from having a loss defined over all input tokens rather than just a subset: ELECTRA 15% performs much worse than ELECTRA
  • We find that BERT performance is being slightly harmed from the pre-train fine-tune mismatch from [MASK] tokens, as Replace masked language modeling slightly outperforms BERT
  • We note that BERT already includes a trick to help with the pre-train/finetune discrepancy: masked tokens are replaced with a random token 10% of the time and are kept the
  • We have proposed replaced token detection, a new self-supervised task for language representation learning
  • The key idea is training a text encoder to distinguish input tokens from high-quality negative samples produced by an small generator network
Methods
  • We first describe the replaced token detection pre-training task; see Figure 2 for an overview.
  • Chef the meal sample the.
  • Generator chef the meal.
  • Our approach trains two neural networks, a generator G and a discriminator D.
  • For a given position t, the discriminator predicts whether the token xt is “real,” i.e., that it comes from the data rather than the generator distribution, with a sigmoid output layer: D(x, t) = sigmoid(wT hD(x)t)
Results
  • ELECTRA-Small performs remarkably well given its size, achieving a higher GLUE score than other methods using substantially more compute and parameters.
  • We find that All-Tokens MLM, the generative model that makes predictions over all tokens instead of a subset, closes most of the gap between BERT and ELECTRA
  • These results suggest a large amount of ELECTRA’s improvement can be attributed to learning from all tokens and a smaller amount can be attributed to alleviating the pre-train fine-tune mismatch.
Conclusion
  • We have proposed replaced token detection, a new self-supervised task for language representation learning.
  • Compared to masked language modeling, our pre-training objective is more compute-efficient and results in better performance on downstream tasks.
  • It works well even when using relatively small amounts of compute, which we hope will make developing and applying pre-trained text encoders more accessible to researchers and practitioners with less access to computing resources.
  • We hope more future work on NLP pre-training will consider efficiency as well as absolute performance, and follow our effort in reporting compute usage and parameter counts along with evaluation metrics
Summary
  • Introduction:

    Current state-of-the-art representation learning methods for language can be viewed as learning denoising autoencoders (Vincent et al, 2008)
  • They select a small subset of the unlabeled input sequence, mask the identities of those tokens (e.g., BERT; Devlin et al (2019)) or attention to those tokens (e.g., XLNet; Yang et al (2019)), and train the network to recover the original input.
  • Methods:

    We first describe the replaced token detection pre-training task; see Figure 2 for an overview.
  • Chef the meal sample the.
  • Generator chef the meal.
  • Our approach trains two neural networks, a generator G and a discriminator D.
  • For a given position t, the discriminator predicts whether the token xt is “real,” i.e., that it comes from the data rather than the generator distribution, with a sigmoid output layer: D(x, t) = sigmoid(wT hD(x)t)
  • Results:

    ELECTRA-Small performs remarkably well given its size, achieving a higher GLUE score than other methods using substantially more compute and parameters.
  • We find that All-Tokens MLM, the generative model that makes predictions over all tokens instead of a subset, closes most of the gap between BERT and ELECTRA
  • These results suggest a large amount of ELECTRA’s improvement can be attributed to learning from all tokens and a smaller amount can be attributed to alleviating the pre-train fine-tune mismatch.
  • Conclusion:

    We have proposed replaced token detection, a new self-supervised task for language representation learning.
  • Compared to masked language modeling, our pre-training objective is more compute-efficient and results in better performance on downstream tasks.
  • It works well even when using relatively small amounts of compute, which we hope will make developing and applying pre-trained text encoders more accessible to researchers and practitioners with less access to computing resources.
  • We hope more future work on NLP pre-training will consider efficiency as well as absolute performance, and follow our effort in reporting compute usage and parameter counts along with evaluation metrics
Tables
  • Table1: Comparison of small models on the GLUE dev set. BERT-Small/Base are our implementation and use the same hyperparameters as ELECTRA-Small/Base. Infer FLOPs assumes single length-128 input. Training times should be taken with a grain of salt as they are for different hardware and with sometimes un-optimized code. ELECTRA performs well even when trained on a single GPU, scoring 5 GLUE points higher than a comparable BERT model and even outscoring the much larger GPT model
  • Table2: Comparison of large models on the GLUE dev set. ELECTRA and RoBERTa are shown for different numbers of pre-training steps, indicated by the numbers after the dashes. ELECTRA performs comparably to XLNet and RoBERTa when using less than 1/4 of their pre-training compute and outperforms them when given a similar amount of pre-training compute. BERT dev results are from <a class="ref-link" id="cClark_et+al_2019_a" href="#rClark_et+al_2019_a">Clark et al (2019</a>)
  • Table3: GLUE test-set results for large models. Models in this table incorporate additional tricks such as ensembling to improve scores (see Appendix B for details). Some models do not have QNLI scores because they treat QNLI as a ranking task, which has recently been disallowed by the GLUE benchmark. To compare against these models, we report the average score excluding QNLI (Avg.*) in addition to the GLUE leaderboard score (Score). “ELECTRA” and “RoBERTa” refer to the fully-trained ELECTRA-1.75M and RoBERTa-500K models
  • Table4: Results on the SQuAD for non-ensemble models
  • Table5: Compute-efficiency experiments (see text for details)
  • Table6: Pre-train hyperparameters. We also train an ELECTRA-Large model for 1.75M steps (other hyperparameters are identical)
  • Table7: Fine-tune hyperparameters
  • Table8: Results for models on the GLUE test set. Only models with single-task finetuning (no ensembling, task-specific tricks, etc.) are shown
Download tables as Excel
Related work
  • Self-Supervised Pre-training for NLP Self-supervised learning has been used to learn word representations (Collobert et al, 2011; Pennington et al, 2014) and more recently contextual representations of words though objectives such as language modeling (Dai & Le, 2015; Peters et al, 2018; Howard & Ruder, 2018). BERT (Devlin et al, 2019) pre-trains a large Transformer (Vaswani et al, 2017) at the masked-language modeling task. There have been numerous extensions to BERT. For example, MASS (Song et al, 2019) and UniLM (Dong et al, 2019) extend BERT to generation tasks by adding auto-regressive generative training objectives. ERNIE (Sun et al, 2019a) and SpanBERT (Joshi et al, 2019) mask out contiguous sequences of token for improved span representations. This idea may be complementary to ELECTRA; we think it would be interesting to make ELECTRA’s generator auto-regressive and add a “replaced span detection” task. Instead of masking out input tokens, XLNet (Yang et al, 2019) masks attention weights such that the input sequence is autoregressively generated in a random order. However, this method suffers from the same inefficiencies as BERT because XLNet only generates 15% of the input tokens in this way. Like ELECTRA, XLNet may alleviate BERT’s pretrain-finetune discrepancy by not requiring [MASK] tokens, although this isn’t entirely clear because XLNet uses two “streams” of attention during pre-training but only one for fine-tuning. Recently, models such as TinyBERT (Jiao et al, 2019) and MobileBERT (Sun et al, 2019b) show that BERT can effectively be distilled down to a smaller model. In contrast, we focus more on pre-training speed rather than inference speed, so we train ELECTRA-Small from scratch.
Funding
  • Kevin is supported by a Google PhD Fellowship
Reference
  • Antoine Bordes, Nicolas Usunier, Alberto Garcıa-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In NeurIPS, 2013.
    Google ScholarFindings
  • Avishek Joey Bose, Huan Ling, and Yanshuai Cao. Adversarial contrastive estimation. In ACL, 2018.
    Google ScholarLocate open access versionFindings
  • Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. Language GANs falling short. arXiv preprint arXiv:1811.02549, 2018.
    Findings
  • Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. Clueweb09 data set, 2009. URL https://lemurproject.org/clueweb09.php/.
    Findings
  • Daniel M. Cer, Mona T. Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In SemEval@ACL, 2017.
    Google ScholarLocate open access versionFindings
  • Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. CVPR, 2005.
    Google ScholarLocate open access versionFindings
  • Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D. Manning, and Quoc V. Le. BAM! Born-again multi-task networks for natural language understanding. In ACL, 2019.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. Natural language processing (almost) from scratch. JMLR, 2011.
    Google ScholarLocate open access versionFindings
  • Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
    Google ScholarLocate open access versionFindings
  • William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In IWP@IJCNLP, 2005.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • In ICLR, 2018.
    Google ScholarFindings
  • Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and William B. Dolan. The third pascal recognizing textual entailment challenge. In ACL-PASCAL@ACL, 2007.
    Google ScholarFindings
  • Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Michael Gutmann and Aapo Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010.
    Google ScholarLocate open access versionFindings
  • Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In ACL, 2018.
    Google ScholarLocate open access versionFindings
  • First Quora dataset release: Question pairs, 2017.
    Google ScholarFindings
  • URL https://data.quora.com/
    Findings
  • Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
    Findings
  • Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529, 2019.
    Findings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR Workshop Papers, 2013.
    Google ScholarLocate open access versionFindings
  • Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword, fifth edition. Technical report, Linguistic Data Consortium, Philadelphia, 2011.
    Google ScholarFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018.
    Google ScholarLocate open access versionFindings
  • Jason Phang, Thibault Fevry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088, 2018.
    Findings
  • Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. https://blog.openai.com/language-unsupervised, 2018.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy S. Liang. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey Levine. Time-contrastive networks: Self-supervised learning from video. ICRA, 2017.
    Google ScholarLocate open access versionFindings
  • Noah A. Smith and Jason Eisner. Contrastive estimation: Training log-linear models on unlabeled data. In ACL, 2005.
    Google ScholarLocate open access versionFindings
  • Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013.
    Google ScholarFindings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223, 2019a.
    Findings
  • Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. MobileBERT: Task-agnostic compression of bert for resource limited devices, 2019b. URL https://openreview.net/forum?id=SJxjVaNKwB.
    Findings
  • Guy Tevet, Gavriel Habib, Vered Shwartz, and Jonathan Berant. Evaluating text gans as language models. In NAACL-HLT, 2018.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2018.
    Findings
  • Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, 2018.
    Google ScholarLocate open access versionFindings
  • Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. XLNet: Generalized autoregressive pretraining for language understanding. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Lantao Yu, Weinan Zhang, Jun Wang, and Yingrui Yu. SeqGAN: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.
    Google ScholarFindings
  • Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. Adversarial feature matching for text generation. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • The following details apply to both our ELECTRA models and BERT baselines. We mostly use the same hyperparameters as BERT. We set λ, the weight for the discriminator objective in the loss to 50.8 We use dynamic token masking with the masked positions decided on-the-fly instead of during preprocessing. Also, we did not use the next sentence prediction objective proposed in the original BERT paper, as recent work has suggested it does not improve scores (Yang et al., 2019; Liu et al., 2019). For our ELECTRA-Large model, we used a higher mask percent (25 instead of 15) because we noticed the generator was achieving high accuracy with 15% masking, resulting in very few replaced tokens. We searched for the best learning rate for the Base and Small models out of [1e-4, 2e-4, 3e-4, 5e-4] and selected λ out of [1, 10, 20, 50, 100] in early experiments. Otherwise we did no hyperparameter tuning beyond the experiments in Section 3.2. The full set of hyperparameters are listed in Table 6.
    Google ScholarLocate open access versionFindings
  • Following BERT, we do not show results on the WNLI GLUE task for the dev set results, as it is difficult to beat even the majority classifier using a standard fine-tuning-as-classifier approach. For the GLUE test set results, we apply the standard tricks used by many of the GLUE leaderboard submissions including RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019), and ALBERT (Lan et al., 2019). Specifically:
    Google ScholarFindings
Your rating :
0

 

Tags
Comments