MASS: Masked Sequence to Sequence Pre-training for Language Generation

international conference on machine learning, 2019.

Cited by: 145|Bibtex|Views108|Links
EI
Keywords:
causal language modellanguage generation taskencoder decodercomputational linguisticslow resourceMore(13+)
Weibo:
We have proposed MAsked Sequence to Sequence pre-training: masked sequence to sequence pre-training for language generation tasks, which reconstructs a sentence fragment given the remaining part of the sentence in the encoder-decoder framework

Abstract:

Pre-training and fine-tuning, e.g., BERT, have achieved great success in language understanding by transferring knowledge from rich-resource pre-training task to the low/zero-resource downstream tasks. Inspired by the success of BERT, we propose MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language gener...More

Code:

Data:

0
Introduction
  • Pre-training and fine-tuning are widely used when target tasks are of low or zero resource in terms of training data, while pre-training has plenty of data (Girshick et al, 2014; Szegedy et al, 2015; Ouyang et al, 2015; Dai & Le, 2015; Howard & Ruder, 2018; Radford et al, 2018; Devlin et al, 2018).
  • Language generation tasks are usually data-hungry, and many of them are low-resource or even zero-source in terms of training data.
  • How to design pre-training methods for the language generation tasks is of great potential and importance
Highlights
  • Pre-training and fine-tuning are widely used when target tasks are of low or zero resource in terms of training data, while pre-training has plenty of data (Girshick et al, 2014; Szegedy et al, 2015; Ouyang et al, 2015; Dai & Le, 2015; Howard & Ruder, 2018; Radford et al, 2018; Devlin et al, 2018)
  • Different from language understanding, language generation aims to generate natural language sentences conditioned on some inputs, including tasks like neural machine translation (NMT) (Cho et al, 2014; Bahdanau et al, 2015a; Vaswani et al, 2017), text summarization (Ayana et al, 2016; Suzuki & Nagata, 2017; Gehring et al, 2017) and conversational response generation (Shang et al, 2015; Vinyals & Le, 2015)
  • To distinguish between the source and target languages in neural machine translation task, we add a language embedding to each token of the input sentence for the encoder and decoder, which is learnt end-to-end
  • Compared with Other Pre-Training Methods We further compare MAsked Sequence to Sequence pre-training with the pre-training methods of BERT+LM and DAE described in Section 4.2, with 100K data on the text summarization task
  • Study of Different k The length of the masked fragment k is an important hyperparameter of MAsked Sequence to Sequence pre-training and we have varied k in Section 3.2 to cover the special cases of masked language modeling in BERT and standard language modeling
  • We have proposed MAsked Sequence to Sequence pre-training: masked sequence to sequence pre-training for language generation tasks, which reconstructs a sentence fragment given the remaining part of the sentence in the encoder-decoder framework
Methods
  • Setting en - fr fr - en en - de de - en en - ro ro - en

    Artetxe et al (2017) Lample et al (2017) Yang et al (2018) Lample et al (2018) XLM (Lample & Conneau, 2019) MASS

    2-layer RNN 3-layer RNN 4-layer Transformer 4-layer Transformer 6-layer Transformer

    6-layer Transformer which use masked language modeling in BERT to pre-train the encoder and the standard language modeling to pre-train the decoder.
  • The authors pre-train the model with BERT+LM and DAE, and fine-tune on the unsupervised translation pairs with same fine-tuning strategy of XLM (i.e., DAE loss + back-translation).
  • These methods are configured with the 6-layer Transformer setting.
  • Neither k = 1 nor k = m can achieve good performance on the downstream language generation tasks, as shown in Figure 5
Results
  • The authors describe the experimental details about MASS pre-training and fine-tuning on a variety of language generation tasks, including NMT, text summarization, conversational response generation.

    4.1.
  • The authors study the performance of MASS with different k, where the authors choose k from 10% to 90% percentage of the sentence length m with a step size of 10%, plus with k = 1 and k = m
  • The authors observe both the performance of MASS after pretraining, as well as the performance after fine-tuning on several language generation tasks, including unsupervised English-French translation, text summarization and conversational response generation.
  • The authors show the curve of the validation BLEU scores on unsupervised En-Fr translation in Figure 5c, the validation ROUGE scores on text (a)
Conclusion
  • The authors have proposed MASS: masked sequence to sequence pre-training for language generation tasks, which reconstructs a sentence fragment given the remaining part of the sentence in the encoder-decoder framework.
  • Through experiments on the three above tasks and total eight datasets, MASS achieved significant improvements over the baseline without pre-training or with other pre-training methods.
  • The authors will apply MASS to more language generation tasks such as sentence paraphrasing, text style transfer and post editing.
  • The authors will investigate more of the theoretical and empirical analysis on the masked sequence to sequence pre-training method
Summary
  • Introduction:

    Pre-training and fine-tuning are widely used when target tasks are of low or zero resource in terms of training data, while pre-training has plenty of data (Girshick et al, 2014; Szegedy et al, 2015; Ouyang et al, 2015; Dai & Le, 2015; Howard & Ruder, 2018; Radford et al, 2018; Devlin et al, 2018).
  • Language generation tasks are usually data-hungry, and many of them are low-resource or even zero-source in terms of training data.
  • How to design pre-training methods for the language generation tasks is of great potential and importance
  • Methods:

    Setting en - fr fr - en en - de de - en en - ro ro - en

    Artetxe et al (2017) Lample et al (2017) Yang et al (2018) Lample et al (2018) XLM (Lample & Conneau, 2019) MASS

    2-layer RNN 3-layer RNN 4-layer Transformer 4-layer Transformer 6-layer Transformer

    6-layer Transformer which use masked language modeling in BERT to pre-train the encoder and the standard language modeling to pre-train the decoder.
  • The authors pre-train the model with BERT+LM and DAE, and fine-tune on the unsupervised translation pairs with same fine-tuning strategy of XLM (i.e., DAE loss + back-translation).
  • These methods are configured with the 6-layer Transformer setting.
  • Neither k = 1 nor k = m can achieve good performance on the downstream language generation tasks, as shown in Figure 5
  • Results:

    The authors describe the experimental details about MASS pre-training and fine-tuning on a variety of language generation tasks, including NMT, text summarization, conversational response generation.

    4.1.
  • The authors study the performance of MASS with different k, where the authors choose k from 10% to 90% percentage of the sentence length m with a step size of 10%, plus with k = 1 and k = m
  • The authors observe both the performance of MASS after pretraining, as well as the performance after fine-tuning on several language generation tasks, including unsupervised English-French translation, text summarization and conversational response generation.
  • The authors show the curve of the validation BLEU scores on unsupervised En-Fr translation in Figure 5c, the validation ROUGE scores on text (a)
  • Conclusion:

    The authors have proposed MASS: masked sequence to sequence pre-training for language generation tasks, which reconstructs a sentence fragment given the remaining part of the sentence in the encoder-decoder framework.
  • Through experiments on the three above tasks and total eight datasets, MASS achieved significant improvements over the baseline without pre-training or with other pre-training methods.
  • The authors will apply MASS to more language generation tasks such as sentence paraphrasing, text style transfer and post editing.
  • The authors will investigate more of the theoretical and empirical analysis on the masked sequence to sequence pre-training method
Tables
  • Table1: Masked language modeling in BERT and standard language modeling, as special cases covered in MASS
  • Table2: The BLEU score comparisons between MASS and the previous works on unsupervised NMT. Results on en-fr and fr-en pairs are reported on newstest2014 and the others are on newstest2016. Since XLM uses different combinations of MLM and CLM in the encoder and decoder, we report the highest BLEU score for XLM on each language pair
  • Table3: The BLEU score comparisons between MASS and other pre-training methods. The results for BERT+LM are directly taken from the MLM+CLM setting in XLM (Lample & Conneau, 2019) as they use the same pre-training methods
  • Table4: The comparisons between MASS and two other pretraining methods in terms of ROUGE score on the text summarization task with 100K training data
  • Table5: The comparisons between MASS and other baseline methods in terms of PPL on Cornell Movie Dialog corpus
  • Table6: The comparison between MASS and the ablation methods in terms of BLEU score on the unsupervised en-fr translation
Download tables as Excel
Related work
  • There are a lot of works on sequence to sequence learning and the pre-training for natural language processing. We briefly review several popular approaches in this section.

    2.1. Sequence to Sequence Learning

    Sequence to sequence learning (Cho et al, 2014; Bahdanau et al, 2015a; Wu et al, 2016; Gehring et al, 2017; Vaswani et al, 2017) is a challenging task in artificial intelligence, and covers a variety of language generation applications such as NMT (Cho et al, 2014; Bahdanau et al, 2015a; Wu et al, 2016; Gehring et al, 2017; Vaswani et al, 2017; Tan et al, 2019; Artetxe et al, 2017; Lample et al, 2017; 2018; He et al, 2018; Hassan et al, 2018; Song et al, 2018; Shen et al, 2018), text summarization (Ayana et al, 2016; Suzuki & Nagata, 2017; Gehring et al, 2017), question answering (Yuan et al, 2017; Fedus et al, 2018) and conversational response generation (Shang et al, 2015; Vinyals & Le, 2015).

    Sequence to sequence learning has attracted much attention in recent years due to the advance of deep learning. However, many language generations tasks such as NMT lack paired data but have plenty of unpaired data. Therefore, the pre-training on unpaired data and fine-tuning with smallscale paired data will be helpful for these tasks, which is exactly the focus of this work.
Funding
  • This work was partially supported by the National Key Research and Development Program of China under Grant 2018YFB1004904
Reference
  • Firat, O., Sankaran, B., Al-Onaizan, Y., Vural, F. T. Y., and Journal of Machine Learning Research, 6(Nov):1817– Cho, K. Zero-resource translation with multi-lingual neural
    Google ScholarLocate open access versionFindings
  • machine translation. In EMNLP, pp. 268–277, 2016.
    Google ScholarLocate open access versionFindings
  • Artetxe, M., Labaka, G., Agirre, E., and Cho, K. Unsuper- Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, vised neural machine translation. CoRR, 2017.
    Google ScholarFindings
  • Ayana, Shen, S., Liu, Z., and Sun, M. Neural headline ICML, volume 70, pp. 1243–1252, 2017.
    Google ScholarLocate open access versionFindings
  • generation with minimum risk training. ArXiv, 2016.
    Google ScholarLocate open access versionFindings
  • ture hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587, 2014.
    Google ScholarLocate open access versionFindings
  • Bahdanau, D., Cho, K., and Bengio, Y. Neural machine English gigaword. In Linguistic Data Consortium, 2003.
    Google ScholarLocate open access versionFindings
  • mann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., Li, Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
    Google ScholarLocate open access versionFindings
  • M., et al. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567, 2018.
    Findings
  • Blitzer, J., McDonald, R., and Pereira, F. Domain adaptation with structural correspondence learning. In EMNLP, He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, pp. 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • pp. 120–128. Association for Computational Linguistics, He, T., Tan, X., Xia, Y., He, D., Qin, T., Chen, Z., and Liu, 2006.
    Google ScholarFindings
  • Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning natural language coder for neural machine translation. In Advances in Neural Information Processing Systems, pp. 7944–7954, 2018.
    Google ScholarLocate open access versionFindings
  • inference. In EMNLP, 2015.
    Google ScholarLocate open access versionFindings
  • Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., for text classification. In ACL, volume 1, pp. 328–339, 2018.
    Google ScholarLocate open access versionFindings
  • and Lai, J. C. Class-based n-gram models of natural lan- Kingma, D. P. and Ba, J. Adam: A method for stochastic guage. Computational linguistics, 18(4):467–479, 1992. optimization. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, Bougares, F., Schwenk, H., and Bengio, Y. Learning R., Torralba, A., and Fidler, S. Skip-thought vectors. In phrase representations using RNN encoder-decoder for NIPS, pp. 3294–3302, 2015. statistical machine translation. In EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Lample, G. and Conneau, A. Cross-lingual language model Collobert, R. and Weston, J. A unified architecture for pretraining. CoRR, abs/1901.07291, 2019.
    Findings
  • natural language processing: Deep neural networks with multitask learning. In ICML, pp. 160–167. ACM, 2008.
    Google ScholarLocate open access versionFindings
  • Lample, G., Conneau, A., Denoyer, L., and Ranzato, M. Un-Dai, A. M. and Le, Q. V. Semi-supervised sequence learning. In NIPS, pp. 3079–3087, 2015.
    Google ScholarLocate open access versionFindings
  • supervised machine translation using monolingual corpora only. CoRR, 2017.
    Google ScholarLocate open access versionFindings
  • imagined conversations: A new approach to understand- In EMNLP, pp. 5039–5049, 2018. ing coordination of linguistic style in dialogs. In ACL
    Google ScholarLocate open access versionFindings
  • Le, Q. and Mikolov, T. Distributed representations of sentences and documents. In ICML, pp. 1188–1196, 2014.
    Google ScholarLocate open access versionFindings
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- Logeswaran, L. and Lee, H. An efficient framework for learnguage understanding. CoRR, 2018.
    Google ScholarLocate open access versionFindings
  • ing sentence representations. CORR, 2018.
    Google ScholarFindings
  • 6294–6305, 2017. In Proceedings of the 27th International Conference on
    Google ScholarLocate open access versionFindings
  • Computational Linguistics, pp. 3064–3074, 2018.
    Google ScholarFindings
  • Suzuki, J. and Nagata, M. Cutting-off redundant repeating In Eleventh Annual Conference of the International Speech generations for neural abstractive summarization. In ACL, Communication Association, 2010.
    Google ScholarLocate open access versionFindings
  • pp. 291–297, 2017.
    Google ScholarFindings
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, J. Distributed representations of words and phrases and their D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going compositionality. In NIPS, pp. 3111–3119, 2013.
    Google ScholarFindings
  • deeper with convolutions. In CVPR, pp. 1–9, 2015.
    Google ScholarLocate open access versionFindings
  • 1895–1903, 2015.
    Google ScholarLocate open access versionFindings
  • entity recognition. In NAACL, pp. 142–147. Association for
    Google ScholarFindings
  • Computational Linguistics, 2003.
    Google ScholarLocate open access versionFindings
  • Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In NAACL, volume 1, pp. 2227–2237, 2018.
    Google ScholarLocate open access versionFindings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In NIPS, pp. 6000–6010, 2017.
    Google ScholarFindings
  • I. Improving language understanding by generative pre- Extracting and composing robust features with denoising training. 2018.
    Google ScholarFindings
  • autoencoders. In ICML, pp. 1096–1103. ACM, 2008.
    Google ScholarLocate open access versionFindings
  • Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad: Vinyals, O. and Le, Q. V. A neural conversational model. 100,000+ questions for machine comprehension of text. CoRR, abs/1506.05869, 2015. CoRR, 2016.
    Findings
  • Ramachandran, P., Liu, P. J., and Le, Q. V. Unsupervised Bowman, S. R. Glue: A multi-task benchmark and analpretraining for sequence to sequence learning. CoRR, ysis platform for natural language understanding. CoRR, abs/1611.02683, 2016.
    Findings
  • abs/1804.07461, 2018.
    Google ScholarLocate open access versionFindings
  • Sennrich, R., Haddow, B., and Birch, A. Neural machine trans- Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., lation of rare words with subword units. In ACL, volume 1, Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, pp. 1715–1725, 2016.
    Google ScholarLocate open access versionFindings
  • K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Shang, L., Lu, Z., and Li, H. Neural responding machine for short-text conversation. In ACL, volume 1, pp. 1577–1586, 2015.
    Google ScholarLocate open access versionFindings
  • Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean, J. Google’s neural machine translation system: Bridg-Shen, Y., Tan, X., He, D., Qin, T., and Liu, T.-Y. Dense information flow for neural machine translation. In Proing the gap between human and machine translation. CoRR, abs/1609.08144, 2016.
    Findings
  • ceedings of the 2018 Conference of the North American Yang, Z., Chen, W., Wang, F., and Xu, B. Unsupervised neural
    Google ScholarLocate open access versionFindings
  • Chapter of the Association for Computational Linguistics: machine translation with weight sharing. In ACL, pp. 46–55, Human Language Technologies, Volume 1 (Long Papers), 2018.
    Google ScholarFindings
  • pp. 1294–1303, June 2018.
    Google ScholarLocate open access versionFindings
  • Ng, A., and Potts, C. Recursive deep models for semantic prehension by text-to-text neural question generation. In compositionality over a sentiment treebank. In EMNLP, pp. Proceedings of the 2nd Workshop on Representation Learn-
    Google ScholarFindings
  • 1631–1642, 2013.
    Google ScholarLocate open access versionFindings
  • ing for NLP, pp. 15–25, 2017.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments