SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation
EMNLP, 2018.
EI
Keywords:
Low Resource Languages for Emergent Incidentsaugmentation schemedifferent scaleReward Augmented Maximum LikelihoodIWSLTMore(14+)
Weibo:
Abstract:
In this work, we examine methods for data augmentation for text-based tasks such as neural machine translation (NMT). We formulate the design of a data augmentation policy with desirable properties as an optimization problem, and derive a generic analytic solution. This solution not only subsumes some existing augmentation schemes, but al...More
Code:
Data:
Introduction
- Introduction and Related Work
Data augmentation algorithms generate extra data points from the empirically observed training set to train subsequent machine learning algorithms. - Fadaee et al (2017) propose to replace words in the target sentences with rare words in the target vocabulary according to a language model, and modify the aligned source words
- While this method generates augmented data with relatively high quality, it requires several complicated preprocessing steps, and is only shown to be effective for low-resource datasets.
- Other generic word replacement methods include word dropout (Sennrich et al, 2016a; Gal and Ghahramani, 2016), which uniformly set some word embeddings to 0 at random, and Reward Augmented Maximum Likelihood (RAML; Norouzi et al (2016)), whose implementation essentially replaces some words in the target sentences with other words from the target vocabulary
Highlights
- Introduction and Related Work
Data augmentation algorithms generate extra data points from the empirically observed training set to train subsequent machine learning algorithms - Data augmentation techniques for neural machine translation fall into two categories
- Other generic word replacement methods include word dropout (Sennrich et al, 2016a; Gal and Ghahramani, 2016), which uniformly set some word embeddings to 0 at random, and Reward Augmented Maximum Likelihood (RAML; Norouzi et al (2016)), whose implementation essentially replaces some words in the target sentences with other words from the target vocabulary
- We report the BLEU scores of SwitchOut, word dropout, and Reward Augmented Maximum Likelihood on the test sets of the tasks in Table 1
- SwitchOut on the source demonstrates as large gains as these obtained by Reward Augmented Maximum Likelihood on the target side, and SwitchOut delivers further improvements when combined with Reward Augmented Maximum Likelihood
- We propose a method to design data augmentation algorithms by solving an optimization problem
Methods
- 2.1 Notations
The authors use uppercase letters, such as X, Y , etc., to denote random variables and lowercase letters such as x, y, etc., to denote the corresponding actual values. - Since the authors will discuss a data augmentation algorithm, the authors will use a hat to denote augmented variables and their values, e.g.
- The authors report the BLEU scores of SwitchOut, word dropout, and RAML on the test sets of the tasks in Table 1.
- The gains in BLEU with SwitchOut over the best baseline on WMT 15 en-de are all significant (p < 0.0002).
- SwitchOut on the source demonstrates as large gains as these obtained by RAML on the target side, and SwitchOut delivers further improvements when combined with RAML
Results
- The gains in BLEU with SwitchOut over the best baseline on WMT 15 en-de are all significant (p < 0.0002).
Conclusion
- The authors propose a method to design data augmentation algorithms by solving an optimization problem.
- Because SwitchOut is expanding the support of the training distribution, the authors would expect that it would help the most on test sentences that are far from those in the training set and would benefit most from this expanded support.
- To test this hypothesis, for each test sentence the authors find its most similar training sample, bucket the instances by the distance to their
Summary
Introduction:
Introduction and Related Work
Data augmentation algorithms generate extra data points from the empirically observed training set to train subsequent machine learning algorithms.- Fadaee et al (2017) propose to replace words in the target sentences with rare words in the target vocabulary according to a language model, and modify the aligned source words
- While this method generates augmented data with relatively high quality, it requires several complicated preprocessing steps, and is only shown to be effective for low-resource datasets.
- Other generic word replacement methods include word dropout (Sennrich et al, 2016a; Gal and Ghahramani, 2016), which uniformly set some word embeddings to 0 at random, and Reward Augmented Maximum Likelihood (RAML; Norouzi et al (2016)), whose implementation essentially replaces some words in the target sentences with other words from the target vocabulary
Methods:
2.1 Notations
The authors use uppercase letters, such as X, Y , etc., to denote random variables and lowercase letters such as x, y, etc., to denote the corresponding actual values.- Since the authors will discuss a data augmentation algorithm, the authors will use a hat to denote augmented variables and their values, e.g.
- The authors report the BLEU scores of SwitchOut, word dropout, and RAML on the test sets of the tasks in Table 1.
- The gains in BLEU with SwitchOut over the best baseline on WMT 15 en-de are all significant (p < 0.0002).
- SwitchOut on the source demonstrates as large gains as these obtained by RAML on the target side, and SwitchOut delivers further improvements when combined with RAML
Results:
The gains in BLEU with SwitchOut over the best baseline on WMT 15 en-de are all significant (p < 0.0002).Conclusion:
The authors propose a method to design data augmentation algorithms by solving an optimization problem.- Because SwitchOut is expanding the support of the training distribution, the authors would expect that it would help the most on test sentences that are far from those in the training set and would benefit most from this expanded support.
- To test this hypothesis, for each test sentence the authors find its most similar training sample, bucket the instances by the distance to their
Tables
- Table1: Test BLEU scores of SwitchOut and other baselines (median of multiple runs). Results marked with † are statistically significant compared to the best result without SwitchOut. For example, for en-de results in the first column, +SwitchOut has significant gain over Transformer; +RAML +SwitchOut has significant gain over +RAML
- Table2: Test BLEU scores of back translation (BT) compared to and combined with SwitchOut (median of 4 runs)
Funding
- We thank Quoc Le, Minh-Thang Luong, Qizhe Xie, and the anonymous EMNLP reviewers, for their suggestions to improve the paper. This material is based upon work supported in part by the Defense Advanced Research Projects Agency Information Innovation Office (I2O) Low Resource Languages for Emergent Incidents (LORELEI) program under Contract No HR0011-15-C0114
Reference
- Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, and more authors. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In ICML.
- Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In ICLR.
- Adam L Berger, Vincent J Della Pietra, and Stephen A Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational linguistics, 22(1):39–71.
- Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. 2009. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542.
- Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In ACL.
- Terrance DeVries and Graham W. Taylor. 2017. Improved regularization of convolutional neural networks with cutout. Arxiv, 1708.04552.
- Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 201Data augmentation for low-resource neural machine translation. In ACL.
- Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In NIPS.
- Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. 2018. Nonautoregressive neural machine translation. In ICLR.
- Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2016. Densely connected convolutional networks. In CVPR.
- Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 20Imagenet classification with deep convolutional neural networks. In NIPS.
- Minh-Thang Luong and Christopher D. Manning. 2015. Stanford neural machine translation systems for spoken language domain. In IWLST.
- Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attentionbased neural machine translation. In EMNLP.
- Xuezhe Ma, Pengcheng Yin, Jingzhou Liu, Graham Neubig, and Eduard Hovy. 2017. Softmax q-distribution estimation for structured prediction: A theoretical interpretation for raml. Arxiv, 1705.07136.
- Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, and Dale Schuurmans. 20Reward augmented maximum likelihood for neural structured prediction. In NIPS.
- Alberto Poncelas, Dimitar Shterionov, Andy Way, Gideon Maillette de Buy Wenniger, and Peyman Passban. 2018. Investigating backtranslation in neural machine translation. Arxiv, 1804.06189.
- Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. In ICLR.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Edinburgh neural machine translation systems for wmt 16. In WMT.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Improving neural machine translation models with monolingual data. In ACL.
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. In CVPR.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
- Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. In BMVC.
Tags
Comments