SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation

EMNLP, 2018.

Cited by: 35|Bibtex|Views43|Links
EI
Keywords:
Low Resource Languages for Emergent Incidentsaugmentation schemedifferent scaleReward Augmented Maximum LikelihoodIWSLTMore(14+)
Weibo:
We propose a method to design data augmentation algorithms by solving an optimization problem

Abstract:

In this work, we examine methods for data augmentation for text-based tasks such as neural machine translation (NMT). We formulate the design of a data augmentation policy with desirable properties as an optimization problem, and derive a generic analytic solution. This solution not only subsumes some existing augmentation schemes, but al...More

Code:

Data:

0
Introduction
  • Introduction and Related Work

    Data augmentation algorithms generate extra data points from the empirically observed training set to train subsequent machine learning algorithms.
  • Fadaee et al (2017) propose to replace words in the target sentences with rare words in the target vocabulary according to a language model, and modify the aligned source words
  • While this method generates augmented data with relatively high quality, it requires several complicated preprocessing steps, and is only shown to be effective for low-resource datasets.
  • Other generic word replacement methods include word dropout (Sennrich et al, 2016a; Gal and Ghahramani, 2016), which uniformly set some word embeddings to 0 at random, and Reward Augmented Maximum Likelihood (RAML; Norouzi et al (2016)), whose implementation essentially replaces some words in the target sentences with other words from the target vocabulary
Highlights
  • Introduction and Related Work

    Data augmentation algorithms generate extra data points from the empirically observed training set to train subsequent machine learning algorithms
  • Data augmentation techniques for neural machine translation fall into two categories
  • Other generic word replacement methods include word dropout (Sennrich et al, 2016a; Gal and Ghahramani, 2016), which uniformly set some word embeddings to 0 at random, and Reward Augmented Maximum Likelihood (RAML; Norouzi et al (2016)), whose implementation essentially replaces some words in the target sentences with other words from the target vocabulary
  • We report the BLEU scores of SwitchOut, word dropout, and Reward Augmented Maximum Likelihood on the test sets of the tasks in Table 1
  • SwitchOut on the source demonstrates as large gains as these obtained by Reward Augmented Maximum Likelihood on the target side, and SwitchOut delivers further improvements when combined with Reward Augmented Maximum Likelihood
  • We propose a method to design data augmentation algorithms by solving an optimization problem
Methods
  • 2.1 Notations

    The authors use uppercase letters, such as X, Y , etc., to denote random variables and lowercase letters such as x, y, etc., to denote the corresponding actual values.
  • Since the authors will discuss a data augmentation algorithm, the authors will use a hat to denote augmented variables and their values, e.g.
  • The authors report the BLEU scores of SwitchOut, word dropout, and RAML on the test sets of the tasks in Table 1.
  • The gains in BLEU with SwitchOut over the best baseline on WMT 15 en-de are all significant (p < 0.0002).
  • SwitchOut on the source demonstrates as large gains as these obtained by RAML on the target side, and SwitchOut delivers further improvements when combined with RAML
Results
  • The gains in BLEU with SwitchOut over the best baseline on WMT 15 en-de are all significant (p < 0.0002).
Conclusion
  • The authors propose a method to design data augmentation algorithms by solving an optimization problem.
  • Because SwitchOut is expanding the support of the training distribution, the authors would expect that it would help the most on test sentences that are far from those in the training set and would benefit most from this expanded support.
  • To test this hypothesis, for each test sentence the authors find its most similar training sample, bucket the instances by the distance to their
Summary
  • Introduction:

    Introduction and Related Work

    Data augmentation algorithms generate extra data points from the empirically observed training set to train subsequent machine learning algorithms.
  • Fadaee et al (2017) propose to replace words in the target sentences with rare words in the target vocabulary according to a language model, and modify the aligned source words
  • While this method generates augmented data with relatively high quality, it requires several complicated preprocessing steps, and is only shown to be effective for low-resource datasets.
  • Other generic word replacement methods include word dropout (Sennrich et al, 2016a; Gal and Ghahramani, 2016), which uniformly set some word embeddings to 0 at random, and Reward Augmented Maximum Likelihood (RAML; Norouzi et al (2016)), whose implementation essentially replaces some words in the target sentences with other words from the target vocabulary
  • Methods:

    2.1 Notations

    The authors use uppercase letters, such as X, Y , etc., to denote random variables and lowercase letters such as x, y, etc., to denote the corresponding actual values.
  • Since the authors will discuss a data augmentation algorithm, the authors will use a hat to denote augmented variables and their values, e.g.
  • The authors report the BLEU scores of SwitchOut, word dropout, and RAML on the test sets of the tasks in Table 1.
  • The gains in BLEU with SwitchOut over the best baseline on WMT 15 en-de are all significant (p < 0.0002).
  • SwitchOut on the source demonstrates as large gains as these obtained by RAML on the target side, and SwitchOut delivers further improvements when combined with RAML
  • Results:

    The gains in BLEU with SwitchOut over the best baseline on WMT 15 en-de are all significant (p < 0.0002).
  • Conclusion:

    The authors propose a method to design data augmentation algorithms by solving an optimization problem.
  • Because SwitchOut is expanding the support of the training distribution, the authors would expect that it would help the most on test sentences that are far from those in the training set and would benefit most from this expanded support.
  • To test this hypothesis, for each test sentence the authors find its most similar training sample, bucket the instances by the distance to their
Tables
  • Table1: Test BLEU scores of SwitchOut and other baselines (median of multiple runs). Results marked with † are statistically significant compared to the best result without SwitchOut. For example, for en-de results in the first column, +SwitchOut has significant gain over Transformer; +RAML +SwitchOut has significant gain over +RAML
  • Table2: Test BLEU scores of back translation (BT) compared to and combined with SwitchOut (median of 4 runs)
Download tables as Excel
Funding
  • We thank Quoc Le, Minh-Thang Luong, Qizhe Xie, and the anonymous EMNLP reviewers, for their suggestions to improve the paper. This material is based upon work supported in part by the Defense Advanced Research Projects Agency Information Innovation Office (I2O) Low Resource Languages for Emergent Incidents (LORELEI) program under Contract No HR0011-15-C0114
Reference
  • Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, and more authors. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In ICML.
    Google ScholarFindings
  • Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In ICLR.
    Google ScholarFindings
  • Adam L Berger, Vincent J Della Pietra, and Stephen A Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational linguistics, 22(1):39–71.
    Google ScholarLocate open access versionFindings
  • Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. 2009. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542.
    Google ScholarLocate open access versionFindings
  • Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In ACL.
    Google ScholarFindings
  • Terrance DeVries and Graham W. Taylor. 2017. Improved regularization of convolutional neural networks with cutout. Arxiv, 1708.04552.
    Findings
  • Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 201Data augmentation for low-resource neural machine translation. In ACL.
    Google ScholarFindings
  • Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In NIPS.
    Google ScholarFindings
  • Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. 2018. Nonautoregressive neural machine translation. In ICLR.
    Google ScholarLocate open access versionFindings
  • Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2016. Densely connected convolutional networks. In CVPR.
    Google ScholarFindings
  • Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
    Google ScholarFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 20Imagenet classification with deep convolutional neural networks. In NIPS.
    Google ScholarFindings
  • Minh-Thang Luong and Christopher D. Manning. 2015. Stanford neural machine translation systems for spoken language domain. In IWLST.
    Google ScholarFindings
  • Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attentionbased neural machine translation. In EMNLP.
    Google ScholarFindings
  • Xuezhe Ma, Pengcheng Yin, Jingzhou Liu, Graham Neubig, and Eduard Hovy. 2017. Softmax q-distribution estimation for structured prediction: A theoretical interpretation for raml. Arxiv, 1705.07136.
    Findings
  • Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, and Dale Schuurmans. 20Reward augmented maximum likelihood for neural structured prediction. In NIPS.
    Google ScholarFindings
  • Alberto Poncelas, Dimitar Shterionov, Andy Way, Gideon Maillette de Buy Wenniger, and Peyman Passban. 2018. Investigating backtranslation in neural machine translation. Arxiv, 1804.06189.
    Findings
  • Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. In ICLR.
    Google ScholarFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Edinburgh neural machine translation systems for wmt 16. In WMT.
    Google ScholarFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Improving neural machine translation models with monolingual data. In ACL.
    Google ScholarFindings
  • Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. In CVPR.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
    Google ScholarFindings
  • Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. In BMVC.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments