BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

ACL, pp. 7871-7880, 2020.

Cited by: 193|Bibtex|Views616|Links
EI
Keywords:
language modelneural machine translationmachine translationnoising functionarbitrary noisingMore(9+)
Weibo:
We present a new scheme for machine translation where a BART model is stacked above a few additional transformer layers

Abstract:

We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen...More
0
Introduction
  • Self-supervised methods have achieved remarkable success in a wide range of NLP tasks (Mikolov et al, 2013; Peters et al, 2018; Devlin et al, 2019; Joshi et al, 2019; Yang et al, 2019; Liu et al, 2019).
  • Recent work has shown gains by improving the distribution of masked tokens (Joshi et al, 2019), the order in which masked tokens are predicted (Yang et al, 2019), and the available context for replacing masked tokens (Dong et al, 2019)
  • These methods typically focus on particular types of end tasks, limiting their applicability.
  • BART uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT, GPT, and many other more recent pretraining schemes
Highlights
  • Self-supervised methods have achieved remarkable success in a wide range of NLP tasks (Mikolov et al, 2013; Peters et al, 2018; Devlin et al, 2019; Joshi et al, 2019; Yang et al, 2019; Liu et al, 2019)
  • Recent work has shown gains by improving the distribution of masked tokens (Joshi et al, 2019), the order in which masked tokens are predicted (Yang et al, 2019), and the available context for replacing masked tokens (Dong et al, 2019). These methods typically focus on particular types of end tasks, limiting their applicability
  • BART is a denoising autoencoder built with a sequence-to-sequence model that is applicable to a very wide range of end tasks
  • We present a new scheme for machine translation where a BART model is stacked above a few additional transformer layers
  • We explore using BART to improve machine translation decoders for translating into English
  • We show that it is possible to use the entire BART model as a single pretrained decoder for machine translation, by adding a new set of encoder parameters that are learned from bitext
Methods
  • The authors pre-train a large model with 12 layers in each of the encoder and decoder, and a hidden size of 1024.
  • Following RoBERTa (Liu et al, 2019), the authors use a batch size of 8000, and train the model for 500000 steps.
  • Documents are tokenized with the same byte-pair encoding as GPT-2 (Radford et al, 2019).
  • Based on the results in Section §4, the authors use a combination of text infilling and sentence permutation.
  • The authors mask 30% of tokens in each document, and permute all sentences.
Results
  • Results are shown in Table 1.
  • Several trends are clear: Model.
  • BERT Base (Devlin et al, 2019).
  • BART Base w/ Token Masking w/ Token Deletion w/ Text Infilling w/ Document Rotation w/ Sentence Shuffling w/ Text Infilling + Sentence Shuffling SQuAD 1.1 F1 MNLI Acc ELI5 PPL XSum PPL ConvAI2 PPL CNN/DM PPL.
  • A simple language model achieves the best ELI5 performance, but the worst SQUAD results
Conclusion
  • The authors introduced BART, a pre-training approach that learns to map corrupted documents to the original.
  • BART achieves similar performance to RoBERTa on discriminative tasks, while achieving new state-of-theart results on a number of text generation tasks.
  • Future work should explore new methods for corrupting documents for pre-training, perhaps tailoring them to specific end tasks
Summary
  • Introduction:

    Self-supervised methods have achieved remarkable success in a wide range of NLP tasks (Mikolov et al, 2013; Peters et al, 2018; Devlin et al, 2019; Joshi et al, 2019; Yang et al, 2019; Liu et al, 2019).
  • Recent work has shown gains by improving the distribution of masked tokens (Joshi et al, 2019), the order in which masked tokens are predicted (Yang et al, 2019), and the available context for replacing masked tokens (Dong et al, 2019)
  • These methods typically focus on particular types of end tasks, limiting their applicability.
  • BART uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT, GPT, and many other more recent pretraining schemes
  • Methods:

    The authors pre-train a large model with 12 layers in each of the encoder and decoder, and a hidden size of 1024.
  • Following RoBERTa (Liu et al, 2019), the authors use a batch size of 8000, and train the model for 500000 steps.
  • Documents are tokenized with the same byte-pair encoding as GPT-2 (Radford et al, 2019).
  • Based on the results in Section §4, the authors use a combination of text infilling and sentence permutation.
  • The authors mask 30% of tokens in each document, and permute all sentences.
  • Results:

    Results are shown in Table 1.
  • Several trends are clear: Model.
  • BERT Base (Devlin et al, 2019).
  • BART Base w/ Token Masking w/ Token Deletion w/ Text Infilling w/ Document Rotation w/ Sentence Shuffling w/ Text Infilling + Sentence Shuffling SQuAD 1.1 F1 MNLI Acc ELI5 PPL XSum PPL ConvAI2 PPL CNN/DM PPL.
  • A simple language model achieves the best ELI5 performance, but the worst SQUAD results
  • Conclusion:

    The authors introduced BART, a pre-training approach that learns to map corrupted documents to the original.
  • BART achieves similar performance to RoBERTa on discriminative tasks, while achieving new state-of-theart results on a number of text generation tasks.
  • Future work should explore new methods for corrupting documents for pre-training, perhaps tailoring them to specific end tasks
Tables
  • Table1: Comparison of pre-training objectives. All models are of comparable size and are trained for 1M steps on a combination of books and Wikipedia data. Entries in the bottom two blocks are trained on identical data using the same code-base, and fine-tuned with the same procedures. Entries in the second block are inspired by pre-training objectives proposed in previous work, but have been simplified to focus on evaluation objectives (see §4.1). Performance varies considerably across tasks, but the BART models with text infilling demonstrate the most consistently strong performance
  • Table2: Results for large models on SQuAD and GLUE tasks. BART performs comparably to RoBERTa and XLNet, suggesting that BART’s uni-directional decoder layers do not reduce performance on discriminative tasks
  • Table3: Results on two standard summarization datasets. BART outperforms previous work on summarization on two tasks and all metrics, with gains of roughly 6 points on the more abstractive dataset
  • Table4: BART outperforms previous work on conversational response generation. Perplexities are renormalized based on official tokenizer for ConvAI2
  • Table5: BART achieves state-of-the-art results on the challenging ELI5 abstractive question answering dataset. Comparison models are from <a class="ref-link" id="cFan_et+al_2019_a" href="#rFan_et+al_2019_a">Fan et al (2019</a>)
  • Table6: The performance (BLEU) of baseline and BART on WMT’16 RO-EN augmented with backtranslation data. BART improves over a strong backtranslation (BT) baseline by using monolingual English pre-training
  • Table7: Example summaries from the XSum-tuned BART model on WikiNews articles. For clarity, only relevant excerpts of the source are shown. Summaries combine information from across the article and prior knowledge
Download tables as Excel
Related work
  • Early methods for pretraining were based on language models. GPT (Radford et al, 2018) only models leftward context, which is problematic for some tasks. ELMo (Peters et al, 2018) concatenates left-only and right-only representations, but does not pre-train interactions between these features. Radford et al (2019) demonstrated that very large language models can act as unsupervised multitask models.

    BERT (Devlin et al, 2019) introduced masked language modelling, which allows pre-training to learn interactions between left and right context words. Recent work has shown that very strong performance can be achieved by training for longer (Liu et al, 2019), by tying parameters across layers (Lan et al, 2019), and by masking spans instead of words (Joshi et al, 2019). Predictions are not made auto-regressively, reducing the effectiveness of BERT for generation tasks.
Funding
  • Presents BART, a denoising autoencoder for pretraining sequence-to-sequence models
  • Evaluates a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token
  • Reports ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance
  • Presents BART, which pre-trains a model combining Bidirectional and Auto-Regressive Transformers
  • BART contains roughly 10% more parameters than the equivalently sized BERT model
Reference
  • Eneko Agirre, Llu’is M‘arquez, and Richard Wicentowski (eds.). Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval2007). Association for Computational Linguistics, Prague, Czech Republic, June 2007.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171– 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/ v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
    Locate open access versionFindings
  • Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conversational intelligence challenge (convai2). arXiv preprint arXiv:1902.00098, 2019.
    Findings
  • William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing, 2005.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pretraining for natural language understanding and generation. arXiv preprint arXiv:1905.03197, 2019.
    Findings
  • Sergey Edunov, Alexei Baevski, and Michael Auli. Pre-trained language model representations for language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
    Google ScholarLocate open access versionFindings
  • Angela Fan, David Grangier, and Michael Auli. Controllable abstractive summarization. arXiv preprint arXiv:1711.05217, 2017.
    Findings
  • Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. arXiv preprint arXiv:1907.09190, 2019.
    Findings
  • Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
    Findings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in neural information processing systems, pp. 1693–1701, 2015.
    Google ScholarLocate open access versionFindings
  • Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529, 2019.
    Findings
  • Guillaume Lample and Alexis Conneau. Crosslingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
    Findings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
    Findings
  • Hector J Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, volume 46, pp. 47, 2011.
    Google ScholarLocate open access versionFindings
  • Yang Liu and Mirella Lapata. Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345, 2019.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    Findings
  • Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topicaware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
    Findings
  • Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
    Findings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.
    Findings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openaiassets/researchcovers/languageunsupervised/language understanding paper.pdf, 2018.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
    Findings
  • Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017.
    Findings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems for WMT 16. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2016.
    Google ScholarLocate open access versionFindings
  • Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP, pp. 1631–1642, 2013.
    Google ScholarLocate open access versionFindings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and TieYan Liu. Mass: Masked sequence to sequence pretraining for language generation. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
    Findings
  • Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments. arXiv preprint 1805.12471, 2018.
    Findings
  • Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
    Findings
  • Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL-HLT, 2018.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
    Findings
Your rating :
0

 

Tags
Comments