Target Conditioned Sampling: Optimizing Data Selection for Multilingual Neural Machine Translation

Meeting of the Association for Computational Linguistics, 2019.

Cited by: 6|Bibtex|Views23|Links
EI
Keywords:
low-resource languagesLow Resource Languages for Emergent Incidentsresource languageNeural Machine Translationlow resource languageMore(9+)
Weibo:
We propose Target Conditioned Sampling, an efficient data selection framework for multilingual data by constructing a data sampling distribution that facilitates the Neural Machine Translation training of low-resource languages

Abstract:

To improve low-resource Neural Machine Translation (NMT) with multilingual corpora, training on the most related high-resource language only is often more effective than using all data available (Neubig and Hu, 2018). However, it is possible that an intelligent data selection strategy can further improve low-resource NMT with data from ...More

Code:

Data:

Introduction
  • Multilingual NMT has led to impressive gains in translation accuracy of low-resource languages (LRL) (Zoph et al, 2016; Firat et al, 2016; Gu et al, 2018; Neubig and Hu, 2018; Nguyen and Chiang, 2018).
  • Examples include TED (Qi et al, 2018), Europarl (Koehn, 2005), and many others (Tiedemann, 2012)
  • These datasets open up the tantalizing prospect of training a system on many different languages to improve accuracy, but previous work has found methods that use only a single related (HRL) often out-perform systems trained on all available data (Neubig and Hu, 2018).
  • The authors go a step further and ask the question: can the authors design an intelligent data selection strategy that allows them to choose the most relevant multilingual data to further boost NMT performance and training speed for LRLs?
Highlights
  • Multilingual Neural Machine Translation has led to impressive gains in translation accuracy of low-resource languages (LRL) (Zoph et al, 2016; Firat et al, 2016; Gu et al, 2018; Neubig and Hu, 2018; Nguyen and Chiang, 2018)
  • We propose and experiment with several design choices for Target Conditioned Sampling, which are especially effective for low-resource languages
  • We propose two strategies for implementation: 1) Stochastic: compute Q(X|y) before training starts, and dynamically sample each minibatch using the precomputed Q(X|y); 2) Deterministic: compute Q(X|y) before training starts and select x = argmaxx Q(x|y) for training
  • Copied is only competitive for slk, which indicates the gain of Target Conditioned Sampling is not due to extra English data
  • We propose Target Conditioned Sampling (TCS), an efficient data selection framework for multilingual data by constructing a data sampling distribution that facilitates the Neural Machine Translation training of low-resource languages
Methods
  • 2.1 Multilingual Training Objective

    First, the authors introduce the problem formally, where the authors use the upper case letters X, Y to denote the random variables, and the corresponding lower case letters x, y to denote their actual values.
  • Let x be a source sentence from s, and y be the equivalent target sentence from t, given loss function L(x, y; θ) the objective is to find optimal parameters θ∗ that minimize: Ex,y∼PS(X,Y )[L(x, y; θ)] (1).
  • The authors want to construct a distribution Q(X, Y ) with support over s1, s2, ..., sn-T to augment the s-t data with samples from Q during training.
  • Vocab-lang Vocab-lang TCS-D TCS-S 10.68 11.09† 10.58 11.46∗.
Results
  • The authors test both the Deterministic (TCS-D) and Stochastic (TCS-S) algorithms described in Section 2.4.
  • The authors experiment with the similarity measures introduced in Section 2.5.
  • Bi in general has the best performance, while All, which uses all the data and takes much longer to train, generally hurts the performance.
  • This is consistent with findings in prior work (Neubig and Hu, 2018).
  • Copied is only competitive for slk, which indicates the gain of TCS is not due to extra English data
Conclusion
  • The authors propose Target Conditioned Sampling (TCS), an efficient data selection framework for multilingual data by constructing a data sampling distribution that facilitates the NMT training of LRLs. TCS brings up to 2 BLEU improvements over strong baselines with only slight increase in training time
Summary
  • Introduction:

    Multilingual NMT has led to impressive gains in translation accuracy of low-resource languages (LRL) (Zoph et al, 2016; Firat et al, 2016; Gu et al, 2018; Neubig and Hu, 2018; Nguyen and Chiang, 2018).
  • Examples include TED (Qi et al, 2018), Europarl (Koehn, 2005), and many others (Tiedemann, 2012)
  • These datasets open up the tantalizing prospect of training a system on many different languages to improve accuracy, but previous work has found methods that use only a single related (HRL) often out-perform systems trained on all available data (Neubig and Hu, 2018).
  • The authors go a step further and ask the question: can the authors design an intelligent data selection strategy that allows them to choose the most relevant multilingual data to further boost NMT performance and training speed for LRLs?
  • Methods:

    2.1 Multilingual Training Objective

    First, the authors introduce the problem formally, where the authors use the upper case letters X, Y to denote the random variables, and the corresponding lower case letters x, y to denote their actual values.
  • Let x be a source sentence from s, and y be the equivalent target sentence from t, given loss function L(x, y; θ) the objective is to find optimal parameters θ∗ that minimize: Ex,y∼PS(X,Y )[L(x, y; θ)] (1).
  • The authors want to construct a distribution Q(X, Y ) with support over s1, s2, ..., sn-T to augment the s-t data with samples from Q during training.
  • Vocab-lang Vocab-lang TCS-D TCS-S 10.68 11.09† 10.58 11.46∗.
  • Results:

    The authors test both the Deterministic (TCS-D) and Stochastic (TCS-S) algorithms described in Section 2.4.
  • The authors experiment with the similarity measures introduced in Section 2.5.
  • Bi in general has the best performance, while All, which uses all the data and takes much longer to train, generally hurts the performance.
  • This is consistent with findings in prior work (Neubig and Hu, 2018).
  • Copied is only competitive for slk, which indicates the gain of TCS is not due to extra English data
  • Conclusion:

    The authors propose Target Conditioned Sampling (TCS), an efficient data selection framework for multilingual data by constructing a data sampling distribution that facilitates the NMT training of LRLs. TCS brings up to 2 BLEU improvements over strong baselines with only slight increase in training time
Tables
  • Table1: Statistics of our datasets
  • Table2: BLEU scores on four languages. Statistical significance (<a class="ref-link" id="cClark_et+al_2011_a" href="#rClark_et+al_2011_a">Clark et al, 2011</a>) is indicated with ∗ (p < 0.001) and † (p < 0.05), compared with the best baseline
  • Table3: BLEU scores using SDE as word encoding. Statistical significance is indicated with ∗ (p < 0.001) and † (p < 0.05), compared with the best baseline
Download tables as Excel
Funding
  • This material is based upon work supported in part by the Defense Advanced Research Projects Agency Information Innovation Office (I2O) Low Resource Languages for Emergent Incidents (LORELEI) program under Contract No HR0011-15-C0114
Reference
  • Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 201Domain adaptation via pseudo in-domain data selection. In EMNLP.
    Google ScholarFindings
  • Boxing Chen, Colin Cherry, George Foster, and Samuel Larkin. 2017. Cost weighting for neural machine translation domain adaptation. In WMT.
    Google ScholarFindings
  • Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In ACL.
    Google ScholarFindings
  • Anna Currey, Antonio Valerio Miceli Barone, and Kenneth Heafield. 2017. Copied monolingual data improves low-resource neural machine translation. In WMT.
    Google ScholarFindings
  • Kevin Duh, Graham Neubig, Katsuhito Sudoh, and Hajime Tsukada. 2013. Adaptation data selection using neural language models: Experiments in machine translation. In ACL.
    Google ScholarFindings
  • Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 201Multi-way, multilingual neural machine translation with a shared attention mechanism. NAACL.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O. K. Li. 2018. Universal neural machine translation for extremely low resource languages. NAACL.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation.
    Google ScholarFindings
  • Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP.
    Google ScholarFindings
  • Robert C. Moore and William D. Lewis. 20Intelligent selection of language model training data. In ACL.
    Google ScholarFindings
  • Graham Neubig and Junjie Hu. 2018. Rapid adaptation of neural machine translation to new languages. EMNLP.
    Google ScholarLocate open access versionFindings
  • Toan Q. Nguyen and David Chiang. 2018. Transfer learning across low-resource, related languages for neural machine translation. In NAACL.
    Google ScholarFindings
  • Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? NAACL.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
    Google ScholarFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
    Google ScholarFindings
  • Jorg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In LREC.
    Google ScholarFindings
  • Rui Wang, Andrew Finch, Masao Utiyama, and Eiichiro Sumita. 20Sentence embedding for neural machine translation domain adaptation. In ACL.
    Google ScholarFindings
  • Xinyi Wang, Hieu Pham, Philip Arthur, and Graham Neubig. 2019. Multilingual neural machine translation with soft decoupled encoding. In ICLR.
    Google ScholarFindings
  • Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. 2018. Switchout: an efficient data augmentation algorithm for neural machine translation. In EMNLP.
    Google ScholarFindings
  • Marlies van der Wees, Arianna Bisazza, and Christof Monz. 2017. Dynamic data selection for neural machine translation. In EMNLP.
    Google ScholarFindings
  • Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low resource neural machine translation. EMNLP.
    Google ScholarLocate open access versionFindings
  • We sligtly modify the LM code from https://github.com/zihangdai/mos for our experiments.
    Findings
Your rating :
0

 

Tags
Comments