Pre-training via Paraphrasing

NeurIPS 2020, 2020.

Cited by: 0|Bibtex|Views135|Links
Keywords:
neural machine translationNational Advisory Committee for Aeronauticsbleu scoreretrieval modelfine tuningMore(9+)
Weibo:
We introduced a new approach to pre-training models for natural language understanding and generation, by using retrieved documents to reconstruct the original

Abstract:

We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual multi-document paraphrasing objective. MARGE provides an alternative to the dominant masked language modeling paradigm, where we self-supervise the reconstruction of target text by retrieving a set of related texts (in many languages)...More

Code:

Data:

0
Introduction
  • Variations on masked language models (MLMs) [Devlin et al, 2019, Liu et al, 2019, Yang et al, 2019b, Conneau et al, 2019, Lewis et al, 2019a, Raffel et al, 2019, Clark et al, 2020] provide highly effective self supervision for pre-training by removing and reconstructing parts of an input text.
  • The authors train MARGE by self-supervising the reconstruction of target text by first retrieving a set of related texts and conditioning on them to maximize the likelihood of generating the original.
  • The retrieval model scores are used to bias the cross attention to the most relevant retrieved documents, allowing the retrieval model to be trained jointly from the reconstruction loss
Highlights
  • Variations on masked language models (MLMs) [Devlin et al, 2019, Liu et al, 2019, Yang et al, 2019b, Conneau et al, 2019, Lewis et al, 2019a, Raffel et al, 2019, Clark et al, 2020] provide highly effective self supervision for pre-training by removing and reconstructing parts of an input text
  • We focus on document level translation tasks, and report document-level BLEU scores
  • The pre-training task is more closely related to downstream tasks than masked language modeling, allowing pre-trained models to achieve BLEU scores as high as 35.8 for translation
  • In each document to be used as a target, which we found improved performance during development
  • We introduced a new approach to pre-training models for natural language understanding and generation, by using retrieved documents to reconstruct the original
Methods
  • As a multi-lingual sequence-to-sequence model, MARGE is applicable to a very broad range of tasks.
  • The authors focus on multi-lingual tasks with elements of retrieval, document comprehension, and document generation, because they are the most directly related to the pre-training.
  • On Tatoeba, there is significant variation across languages, but overall MARGE performs comparably to XLM-R and significantly better than other pre-trained models.
  • Better results have been achieved on both tasks using labeled bitext for training [Artetxe and Schwenk, 2019], but the results suggest that the pre-training objective learns an effective cross-lingual retrieval function
Results
  • The authors explore zero-shot summarization, where the model is trained on all languages except the test language—this model outperforms a strong lead-3 baseline, and even a supervised pointer-generator model on Spanish and Russian.
  • On this domain, the authors achieve better results with MARGE-NEWS, a version of the model trained only on news.
Conclusion
  • MARGE shows strong performance on a wider range of tasks than any previous pre-trained models, and is effective at discriminative and generative tasks in many languages.
  • MARGE exhibits strong performance on a range of discriminative and generative tasks in many languages, both with and without fine-tuning
  • These results establish MARGE as a viable alternative to masked language modeling and provide a step towards pre-trained models that can perform any task with little or no fine-tuning.
  • Future work should scale MARGE to more domains and languages, and study how to more closely align pre-training objectives with different end tasks
Summary
  • Introduction:

    Variations on masked language models (MLMs) [Devlin et al, 2019, Liu et al, 2019, Yang et al, 2019b, Conneau et al, 2019, Lewis et al, 2019a, Raffel et al, 2019, Clark et al, 2020] provide highly effective self supervision for pre-training by removing and reconstructing parts of an input text.
  • The authors train MARGE by self-supervising the reconstruction of target text by first retrieving a set of related texts and conditioning on them to maximize the likelihood of generating the original.
  • The retrieval model scores are used to bias the cross attention to the most relevant retrieved documents, allowing the retrieval model to be trained jointly from the reconstruction loss
  • Objectives:

    Batching The authors aim to construct batches containing clusters of related target and evidence documents, to maximize available information for reconstructing each target.
  • Methods:

    As a multi-lingual sequence-to-sequence model, MARGE is applicable to a very broad range of tasks.
  • The authors focus on multi-lingual tasks with elements of retrieval, document comprehension, and document generation, because they are the most directly related to the pre-training.
  • On Tatoeba, there is significant variation across languages, but overall MARGE performs comparably to XLM-R and significantly better than other pre-trained models.
  • Better results have been achieved on both tasks using labeled bitext for training [Artetxe and Schwenk, 2019], but the results suggest that the pre-training objective learns an effective cross-lingual retrieval function
  • Results:

    The authors explore zero-shot summarization, where the model is trained on all languages except the test language—this model outperforms a strong lead-3 baseline, and even a supervised pointer-generator model on Spanish and Russian.
  • On this domain, the authors achieve better results with MARGE-NEWS, a version of the model trained only on news.
  • Conclusion:

    MARGE shows strong performance on a wider range of tasks than any previous pre-trained models, and is effective at discriminative and generative tasks in many languages.
  • MARGE exhibits strong performance on a range of discriminative and generative tasks in many languages, both with and without fine-tuning
  • These results establish MARGE as a viable alternative to masked language modeling and provide a step towards pre-trained models that can perform any task with little or no fine-tuning.
  • Future work should scale MARGE to more domains and languages, and study how to more closely align pre-training objectives with different end tasks
Tables
  • Table1: Comparison models: MARGE is pre-trained on a scale between XLM and XLM-R
  • Table2: Zero-shot unsupervised document level machine translation BLEU scores using the pre-trained model, with no fine-tuning or special constraints on generation. Performance varies considerably across languages, but is non-trivial with even distantly related languages
  • Table3: Unsupervised Sentence Retrieval results on BUCC. MARGE outperforms other unsupervised models
  • Table4: Supervised document-level machine translation. Comparison results are from Liu et al [2020]. MARGE performs similarly to mBART
  • Table5: ROUGE-L scores on MLSum. MARGE generates abstractive summaries that outperform an extractive mBERT model. We also demonstrate zero-shot transfer learning, where the model is trained only on languages it is not trained on, and results from training on all languages
  • Table6: Cross-lingual transfer: models are trained on English (en) and tested on other languages. MARGE performs competitively with XLM-R, with 20% of the pre-training compute
  • Table7: Example zero-shot unsupervised inputs and outputs (truncated for clarity)
  • Table8: Tatoeba zero-shot sentence retrieval results. MARGE performs comparably to XLM-R, but with significant variation across languages. We only show results for languages in all model’s pre-training data
  • Table9: Number of documents per language used for pre-training. Languages represent a range of families and geographical regions. The Germanic, Hellenic, Romance, Slavic, and Indo-Iranian families are part of a broader Indo-European family
Download tables as Excel
Related work
  • NLP pre-training Since BERT [Devlin et al, 2019], pre-training for NLP has been dominated by variants of masked language models. For example, Yang et al [2019b] predicts the masked tokens auto-regressively, Dong et al [2019] multitasks MLM and language modeling objectives, Clark et al [2020] trains a discriminator to classify the correctness of MLM samples, and Lewis et al [2019a] and Raffel et al [2019] use seq2seq models with masked inputs. MARGE departs significantly from these objectives in that the inputs during pre-training are complete, uncorrupted text.

    Bitext Mining Recent work has shown impressive results on machine translation through bitext mining [Schwenk et al, 2019], in which a retrieval model is used to search for parallel sentences in a large multilingual corpus, which are then used as training data for a machine translation model. A key conceptual difference is that literal bitext is not optimal for our approach, as we hope to learn linguistic information by training on noisy document-level paraphrases. We also learn to retrieve and translate with no manually translated sentences, unlike existing bitext mining methods.
Reference
  • Mikel Artetxe and Holger Schwenk. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610, 2019.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041, 2017.
    Findings
  • Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
    Findings
  • Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
    Locate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197, 2019.
    Findings
  • Angela Fan, David Grangier, and Michael Auli. Controllable abstractive summarization. arXiv preprint arXiv:1711.05217, 2017.
    Findings
  • Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, and Percy Liang. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437–450, 2018.
    Google ScholarLocate open access versionFindings
  • Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrievalaugmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020.
    Findings
  • Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080, 2020.
    Findings
  • Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 2019.
    Google ScholarLocate open access versionFindings
  • Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, 2017.
    Google ScholarLocate open access versionFindings
  • Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
    Findings
  • Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
    Findings
  • Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
    Findings
  • Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
    Findings
  • Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043, 2017.
    Findings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019a.
    Findings
  • Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. Mlqa: Evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475, 2019b.
    Findings
  • Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv preprint arXiv:2005.11401, 2020.
    Findings
  • Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E Gonzalez. Train large, then compress: Rethinking model size for efficient training and inference of transformers. arXiv preprint arXiv:2002.11794, 2020.
    Findings
  • Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv:2001.08210, 2020.
    Findings
  • Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305, 2017.
    Google ScholarLocate open access versionFindings
  • Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson. Document-level neural machine translation with hierarchical attention networks. arXiv preprint arXiv:1809.01576, 2018.
    Findings
  • Matt Post. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018.
    Findings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
    Findings
  • Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in bertology: What we know about how bert works. arXiv preprint arXiv:2002.12327, 2020.
    Findings
  • Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, and Armand Joulin. Ccmatrix: Mining billions of high-quality parallel sentences on the web. arXiv preprint arXiv:1911.04944, 2019.
    Findings
  • Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. Mlsum: The multilingual summarization corpus. arXiv preprint arXiv:2004.14900, 2020.
    Findings
  • Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Arivazhagan, Jason Riesa, Ankur Bapna, Orhan Firat, and Karthik Raman. Evaluating the cross-lingual effectiveness of massively multilingual neural machine translation. arXiv preprint arXiv:1909.00437, 2019.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • John Wieting and Douwe Kiela. No training required: Exploring random encoders for sentence classification. arXiv preprint arXiv:1901.10444, 2019.
    Findings
  • Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. Paws-x: A cross-lingual adversarial dataset for paraphrase identification. arXiv preprint arXiv:1908.11828, 2019a.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019b.
    Findings
  • Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. Overview of the third bucc shared task: Spotting parallel sentences in comparable corpora. In Proceedings of 11th Workshop on Building and Using Comparable Corpora, pages 39–42, 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments