MASS: Masked Sequence to Sequence Pre-training for Language Generation
international conference on machine learning, 2019.
EI
Keywords:
causal language modellanguage generation taskencoder decodercomputational linguisticslow resourceMore(13+)
Weibo:
Abstract:
Pre-training and fine-tuning, e.g., BERT, have achieved great success in language understanding by transferring knowledge from rich-resource pre-training task to the low/zero-resource downstream tasks. Inspired by the success of BERT, we propose MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language gener...More
Code:
Data:
Introduction
- Pre-training and fine-tuning are widely used when target tasks are of low or zero resource in terms of training data, while pre-training has plenty of data (Girshick et al, 2014; Szegedy et al, 2015; Ouyang et al, 2015; Dai & Le, 2015; Howard & Ruder, 2018; Radford et al, 2018; Devlin et al, 2018).
- Language generation tasks are usually data-hungry, and many of them are low-resource or even zero-source in terms of training data.
- How to design pre-training methods for the language generation tasks is of great potential and importance
Highlights
- Pre-training and fine-tuning are widely used when target tasks are of low or zero resource in terms of training data, while pre-training has plenty of data (Girshick et al, 2014; Szegedy et al, 2015; Ouyang et al, 2015; Dai & Le, 2015; Howard & Ruder, 2018; Radford et al, 2018; Devlin et al, 2018)
- Different from language understanding, language generation aims to generate natural language sentences conditioned on some inputs, including tasks like neural machine translation (NMT) (Cho et al, 2014; Bahdanau et al, 2015a; Vaswani et al, 2017), text summarization (Ayana et al, 2016; Suzuki & Nagata, 2017; Gehring et al, 2017) and conversational response generation (Shang et al, 2015; Vinyals & Le, 2015)
- To distinguish between the source and target languages in neural machine translation task, we add a language embedding to each token of the input sentence for the encoder and decoder, which is learnt end-to-end
- Compared with Other Pre-Training Methods We further compare MAsked Sequence to Sequence pre-training with the pre-training methods of BERT+LM and DAE described in Section 4.2, with 100K data on the text summarization task
- Study of Different k The length of the masked fragment k is an important hyperparameter of MAsked Sequence to Sequence pre-training and we have varied k in Section 3.2 to cover the special cases of masked language modeling in BERT and standard language modeling
- We have proposed MAsked Sequence to Sequence pre-training: masked sequence to sequence pre-training for language generation tasks, which reconstructs a sentence fragment given the remaining part of the sentence in the encoder-decoder framework
Methods
- Setting en - fr fr - en en - de de - en en - ro ro - en
Artetxe et al (2017) Lample et al (2017) Yang et al (2018) Lample et al (2018) XLM (Lample & Conneau, 2019) MASS
2-layer RNN 3-layer RNN 4-layer Transformer 4-layer Transformer 6-layer Transformer
6-layer Transformer which use masked language modeling in BERT to pre-train the encoder and the standard language modeling to pre-train the decoder. - The authors pre-train the model with BERT+LM and DAE, and fine-tune on the unsupervised translation pairs with same fine-tuning strategy of XLM (i.e., DAE loss + back-translation).
- These methods are configured with the 6-layer Transformer setting.
- Neither k = 1 nor k = m can achieve good performance on the downstream language generation tasks, as shown in Figure 5
Results
- The authors describe the experimental details about MASS pre-training and fine-tuning on a variety of language generation tasks, including NMT, text summarization, conversational response generation.
4.1. - The authors study the performance of MASS with different k, where the authors choose k from 10% to 90% percentage of the sentence length m with a step size of 10%, plus with k = 1 and k = m
- The authors observe both the performance of MASS after pretraining, as well as the performance after fine-tuning on several language generation tasks, including unsupervised English-French translation, text summarization and conversational response generation.
- The authors show the curve of the validation BLEU scores on unsupervised En-Fr translation in Figure 5c, the validation ROUGE scores on text (a)
Conclusion
- The authors have proposed MASS: masked sequence to sequence pre-training for language generation tasks, which reconstructs a sentence fragment given the remaining part of the sentence in the encoder-decoder framework.
- Through experiments on the three above tasks and total eight datasets, MASS achieved significant improvements over the baseline without pre-training or with other pre-training methods.
- The authors will apply MASS to more language generation tasks such as sentence paraphrasing, text style transfer and post editing.
- The authors will investigate more of the theoretical and empirical analysis on the masked sequence to sequence pre-training method
Summary
Introduction:
Pre-training and fine-tuning are widely used when target tasks are of low or zero resource in terms of training data, while pre-training has plenty of data (Girshick et al, 2014; Szegedy et al, 2015; Ouyang et al, 2015; Dai & Le, 2015; Howard & Ruder, 2018; Radford et al, 2018; Devlin et al, 2018).- Language generation tasks are usually data-hungry, and many of them are low-resource or even zero-source in terms of training data.
- How to design pre-training methods for the language generation tasks is of great potential and importance
Methods:
Setting en - fr fr - en en - de de - en en - ro ro - en
Artetxe et al (2017) Lample et al (2017) Yang et al (2018) Lample et al (2018) XLM (Lample & Conneau, 2019) MASS
2-layer RNN 3-layer RNN 4-layer Transformer 4-layer Transformer 6-layer Transformer
6-layer Transformer which use masked language modeling in BERT to pre-train the encoder and the standard language modeling to pre-train the decoder.- The authors pre-train the model with BERT+LM and DAE, and fine-tune on the unsupervised translation pairs with same fine-tuning strategy of XLM (i.e., DAE loss + back-translation).
- These methods are configured with the 6-layer Transformer setting.
- Neither k = 1 nor k = m can achieve good performance on the downstream language generation tasks, as shown in Figure 5
Results:
The authors describe the experimental details about MASS pre-training and fine-tuning on a variety of language generation tasks, including NMT, text summarization, conversational response generation.
4.1.- The authors study the performance of MASS with different k, where the authors choose k from 10% to 90% percentage of the sentence length m with a step size of 10%, plus with k = 1 and k = m
- The authors observe both the performance of MASS after pretraining, as well as the performance after fine-tuning on several language generation tasks, including unsupervised English-French translation, text summarization and conversational response generation.
- The authors show the curve of the validation BLEU scores on unsupervised En-Fr translation in Figure 5c, the validation ROUGE scores on text (a)
Conclusion:
The authors have proposed MASS: masked sequence to sequence pre-training for language generation tasks, which reconstructs a sentence fragment given the remaining part of the sentence in the encoder-decoder framework.- Through experiments on the three above tasks and total eight datasets, MASS achieved significant improvements over the baseline without pre-training or with other pre-training methods.
- The authors will apply MASS to more language generation tasks such as sentence paraphrasing, text style transfer and post editing.
- The authors will investigate more of the theoretical and empirical analysis on the masked sequence to sequence pre-training method
Tables
- Table1: Masked language modeling in BERT and standard language modeling, as special cases covered in MASS
- Table2: The BLEU score comparisons between MASS and the previous works on unsupervised NMT. Results on en-fr and fr-en pairs are reported on newstest2014 and the others are on newstest2016. Since XLM uses different combinations of MLM and CLM in the encoder and decoder, we report the highest BLEU score for XLM on each language pair
- Table3: The BLEU score comparisons between MASS and other pre-training methods. The results for BERT+LM are directly taken from the MLM+CLM setting in XLM (Lample & Conneau, 2019) as they use the same pre-training methods
- Table4: The comparisons between MASS and two other pretraining methods in terms of ROUGE score on the text summarization task with 100K training data
- Table5: The comparisons between MASS and other baseline methods in terms of PPL on Cornell Movie Dialog corpus
- Table6: The comparison between MASS and the ablation methods in terms of BLEU score on the unsupervised en-fr translation
Related work
- There are a lot of works on sequence to sequence learning and the pre-training for natural language processing. We briefly review several popular approaches in this section.
2.1. Sequence to Sequence Learning
Sequence to sequence learning (Cho et al, 2014; Bahdanau et al, 2015a; Wu et al, 2016; Gehring et al, 2017; Vaswani et al, 2017) is a challenging task in artificial intelligence, and covers a variety of language generation applications such as NMT (Cho et al, 2014; Bahdanau et al, 2015a; Wu et al, 2016; Gehring et al, 2017; Vaswani et al, 2017; Tan et al, 2019; Artetxe et al, 2017; Lample et al, 2017; 2018; He et al, 2018; Hassan et al, 2018; Song et al, 2018; Shen et al, 2018), text summarization (Ayana et al, 2016; Suzuki & Nagata, 2017; Gehring et al, 2017), question answering (Yuan et al, 2017; Fedus et al, 2018) and conversational response generation (Shang et al, 2015; Vinyals & Le, 2015).
Sequence to sequence learning has attracted much attention in recent years due to the advance of deep learning. However, many language generations tasks such as NMT lack paired data but have plenty of unpaired data. Therefore, the pre-training on unpaired data and fine-tuning with smallscale paired data will be helpful for these tasks, which is exactly the focus of this work.
Funding
- This work was partially supported by the National Key Research and Development Program of China under Grant 2018YFB1004904
Reference
- Firat, O., Sankaran, B., Al-Onaizan, Y., Vural, F. T. Y., and Journal of Machine Learning Research, 6(Nov):1817– Cho, K. Zero-resource translation with multi-lingual neural
- machine translation. In EMNLP, pp. 268–277, 2016.
- Artetxe, M., Labaka, G., Agirre, E., and Cho, K. Unsuper- Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, vised neural machine translation. CoRR, 2017.
- Ayana, Shen, S., Liu, Z., and Sun, M. Neural headline ICML, volume 70, pp. 1243–1252, 2017.
- generation with minimum risk training. ArXiv, 2016.
- ture hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587, 2014.
- Bahdanau, D., Cho, K., and Bengio, Y. Neural machine English gigaword. In Linguistic Data Consortium, 2003.
- mann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., Li, Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
- M., et al. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567, 2018.
- Blitzer, J., McDonald, R., and Pereira, F. Domain adaptation with structural correspondence learning. In EMNLP, He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, pp. 770–778, 2016.
- pp. 120–128. Association for Computational Linguistics, He, T., Tan, X., Xia, Y., He, D., Qin, T., Chen, Z., and Liu, 2006.
- Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning natural language coder for neural machine translation. In Advances in Neural Information Processing Systems, pp. 7944–7954, 2018.
- inference. In EMNLP, 2015.
- Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., for text classification. In ACL, volume 1, pp. 328–339, 2018.
- and Lai, J. C. Class-based n-gram models of natural lan- Kingma, D. P. and Ba, J. Adam: A method for stochastic guage. Computational linguistics, 18(4):467–479, 1992. optimization. In ICLR, 2015.
- Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, Bougares, F., Schwenk, H., and Bengio, Y. Learning R., Torralba, A., and Fidler, S. Skip-thought vectors. In phrase representations using RNN encoder-decoder for NIPS, pp. 3294–3302, 2015. statistical machine translation. In EMNLP, 2014.
- Lample, G. and Conneau, A. Cross-lingual language model Collobert, R. and Weston, J. A unified architecture for pretraining. CoRR, abs/1901.07291, 2019.
- natural language processing: Deep neural networks with multitask learning. In ICML, pp. 160–167. ACM, 2008.
- Lample, G., Conneau, A., Denoyer, L., and Ranzato, M. Un-Dai, A. M. and Le, Q. V. Semi-supervised sequence learning. In NIPS, pp. 3079–3087, 2015.
- supervised machine translation using monolingual corpora only. CoRR, 2017.
- imagined conversations: A new approach to understand- In EMNLP, pp. 5039–5049, 2018. ing coordination of linguistic style in dialogs. In ACL
- Le, Q. and Mikolov, T. Distributed representations of sentences and documents. In ICML, pp. 1188–1196, 2014.
- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- Logeswaran, L. and Lee, H. An efficient framework for learnguage understanding. CoRR, 2018.
- ing sentence representations. CORR, 2018.
- 6294–6305, 2017. In Proceedings of the 27th International Conference on
- Computational Linguistics, pp. 3064–3074, 2018.
- Suzuki, J. and Nagata, M. Cutting-off redundant repeating In Eleventh Annual Conference of the International Speech generations for neural abstractive summarization. In ACL, Communication Association, 2010.
- pp. 291–297, 2017.
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, J. Distributed representations of words and phrases and their D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going compositionality. In NIPS, pp. 3111–3119, 2013.
- deeper with convolutions. In CVPR, pp. 1–9, 2015.
- 1895–1903, 2015.
- entity recognition. In NAACL, pp. 142–147. Association for
- Computational Linguistics, 2003.
- Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In NAACL, volume 1, pp. 2227–2237, 2018.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In NIPS, pp. 6000–6010, 2017.
- I. Improving language understanding by generative pre- Extracting and composing robust features with denoising training. 2018.
- autoencoders. In ICML, pp. 1096–1103. ACM, 2008.
- Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad: Vinyals, O. and Le, Q. V. A neural conversational model. 100,000+ questions for machine comprehension of text. CoRR, abs/1506.05869, 2015. CoRR, 2016.
- Ramachandran, P., Liu, P. J., and Le, Q. V. Unsupervised Bowman, S. R. Glue: A multi-task benchmark and analpretraining for sequence to sequence learning. CoRR, ysis platform for natural language understanding. CoRR, abs/1611.02683, 2016.
- abs/1804.07461, 2018.
- Sennrich, R., Haddow, B., and Birch, A. Neural machine trans- Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., lation of rare words with subword units. In ACL, volume 1, Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, pp. 1715–1725, 2016.
- K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Shang, L., Lu, Z., and Li, H. Neural responding machine for short-text conversation. In ACL, volume 1, pp. 1577–1586, 2015.
- Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean, J. Google’s neural machine translation system: Bridg-Shen, Y., Tan, X., He, D., Qin, T., and Liu, T.-Y. Dense information flow for neural machine translation. In Proing the gap between human and machine translation. CoRR, abs/1609.08144, 2016.
- ceedings of the 2018 Conference of the North American Yang, Z., Chen, W., Wang, F., and Xu, B. Unsupervised neural
- Chapter of the Association for Computational Linguistics: machine translation with weight sharing. In ACL, pp. 46–55, Human Language Technologies, Volume 1 (Long Papers), 2018.
- pp. 1294–1303, June 2018.
- Ng, A., and Potts, C. Recursive deep models for semantic prehension by text-to-text neural question generation. In compositionality over a sentiment treebank. In EMNLP, pp. Proceedings of the 2nd Workshop on Representation Learn-
- 1631–1642, 2013.
- ing for NLP, pp. 15–25, 2017.
Tags
Comments