Balancing Training for Multilingual Neural Machine Translation

ACL, pp. 8526-8537, 2020.

Cited by: 4|Bibtex|Views40|Links
EI
Keywords:
entity recognitionmultilingual modellow resource neural machine translationlowresource languageshigh-resource languagesMore(8+)
Weibo:
We extend and improve over previous work on DDS, with a more efficient algorithmic instantiation tailored for the multilingual training problem and a stable reward to optimize multiple objectives

Abstract:

When training multilingual machine translation (MT) models that can translate to/from multiple languages, we are faced with imbalanced training sets: some languages have much more training data than others. Standard practice is to up-sample less resourced languages to increase representation, and the degree of up-sampling has a large ef...More
Introduction
Highlights
  • Multilingual models are trained to process different languages in a single model, and have been applied to a wide variety of NLP tasks such as text classification (Klementiev et al, 2012; Chen et al, 2018a), syntactic analysis (Plank et al, 2016; Ammar et al, 2016), named-entity recognition (Xie et al, 2018; Wu and Dredze, 2019), and machine translation (MT) (Dong et al, 2015; Johnson et al, 2016)
  • A common problem with multilingual training is that the data from different languages are both heterogeneous and imbalanced
  • We ask the question: “is it possible to learn an optimal strategy to automatically balance the usage of data in multilingual model training?” To this effect, we propose a method that learns a language scorer that can be used throughout training to improve the model performance on all languages
  • We propose MultiDDS, an algorithm that learns a language scorer to optimize multilingual data usage to achieve good performance on many different languages
  • We extend and improve over previous work on DDS (Wang et al, 2019b), with a more efficient algorithmic instantiation tailored for the multilingual training problem and a stable reward to optimize multiple objectives
  • MultiDDS is not limited to NMT, and future work may consider applications to other multilingual tasks
Methods
  • MultiDDS directly parameterizes the standard dataset sampling distribution for multilingual training with ψ: PD(i; ψ) = eψi / nk=1eψk (8).
  • Unlike standard DDS the authors make the design decision to weight training datasets rather than score each training example x, y directly, as it is more efficient and likely easier to learn.
  • The gains for the Related group are larger than for the Diverse group, likely because MultiDDS can take better advantage of language similarities than the baseline methods.
Results
  • The authors first show the average BLEU score over all languages for each translation setting in Tab. 1.
  • The authors can see that MultiDDS outperforms the best baseline in three of the four settings and is comparable to proportional sampling in the last M2O-Diverse setting.
  • MultiDDS-S consistently delivers better overall performance than the best baseline, and outperforms MultiDDS in three settings.
  • From these results, the authors can conclude that MultiDDS-S provides a stable strategy to train multilingual systems over a variety of settings.
Conclusion
  • The authors propose MultiDDS, an algorithm that learns a language scorer to optimize multilingual data usage to achieve good performance on many different languages.
  • The authors extend and improve over previous work on DDS (Wang et al, 2019b), with a more efficient algorithmic instantiation tailored for the multilingual training problem and a stable reward to optimize multiple objectives.
  • MultiDDS outperforms prior methods in terms of overall performance on all languages, and provides a flexible framework to prioritize different multilingual objectives.
  • There are other conceivable multilingual optimization objectives than those the authors explored in § 6.4
Summary
  • Introduction:

    Multilingual models are trained to process different languages in a single model, and have been applied to a wide variety of NLP tasks such as text classification (Klementiev et al, 2012; Chen et al, 2018a), syntactic analysis (Plank et al, 2016; Ammar et al, 2016), named-entity recognition (Xie et al, 2018; Wu and Dredze, 2019), and machine translation (MT) (Dong et al, 2015; Johnson et al, 2016).
  • This is especially the case for modestly-sized models that are conducive to efficient deployment (Arivazhagan et al, 2019; Conneau et al, 2019)
  • Methods:

    MultiDDS directly parameterizes the standard dataset sampling distribution for multilingual training with ψ: PD(i; ψ) = eψi / nk=1eψk (8).
  • Unlike standard DDS the authors make the design decision to weight training datasets rather than score each training example x, y directly, as it is more efficient and likely easier to learn.
  • The gains for the Related group are larger than for the Diverse group, likely because MultiDDS can take better advantage of language similarities than the baseline methods.
  • Results:

    The authors first show the average BLEU score over all languages for each translation setting in Tab. 1.
  • The authors can see that MultiDDS outperforms the best baseline in three of the four settings and is comparable to proportional sampling in the last M2O-Diverse setting.
  • MultiDDS-S consistently delivers better overall performance than the best baseline, and outperforms MultiDDS in three settings.
  • From these results, the authors can conclude that MultiDDS-S provides a stable strategy to train multilingual systems over a variety of settings.
  • Conclusion:

    The authors propose MultiDDS, an algorithm that learns a language scorer to optimize multilingual data usage to achieve good performance on many different languages.
  • The authors extend and improve over previous work on DDS (Wang et al, 2019b), with a more efficient algorithmic instantiation tailored for the multilingual training problem and a stable reward to optimize multiple objectives.
  • MultiDDS outperforms prior methods in terms of overall performance on all languages, and provides a flexible framework to prioritize different multilingual objectives.
  • There are other conceivable multilingual optimization objectives than those the authors explored in § 6.4
Tables
  • Table1: Average BLEU for the baselines and our methods. Bold indicates the highest value
  • Table2: BLEU scores of the best baseline and MultiDDS-S for all translation settings. MultiDDS-S performs better on more languages. For each setting, bold indicates the highest value, and ∗ means the gains are statistically significant with p < 0.05
  • Table3: Average BLEU of the best baseline and three MultiDDS-S settings for the Diverse group. MultiDDS-S always outperform the baseline
  • Table4: Mean and variance of the average BLEU score for the Diverse group. The models trained with MultiDDS-S perform better and have less variance
  • Table5: Ave. BLEU for the Related language group. The step-ahead reward proposed in the paper is better or comparable with the moving average, and both are better than the baseline
  • Table6: Statistics of the related language group
  • Table7: Statistics of the diverse language group
  • Table8: BLEU score of the baselines and our method on the Related language group for many-to-one translation
  • Table9: BLEU score of the baselines and our method on the Diverse language group for many-to-one translation
  • Table10: BLEU score of the baselines and our method on the Related language group for one-to-many translation
  • Table11: BLEU score of the baselines and our method on the Diverse language group for one-to-many translation
Download tables as Excel
Related work
  • Our work is related to the multilingual training methods in general. Multilingual training has a rich history (Schultz and Waibel, 1998; Mimno et al, 2009; Shi et al, 2010; Tackstrom et al, 2013), but has become particularly prominent in recent years due the ability of neural networks to easily perform multi-task learning (Dong et al, 2015; Plank et al, 2016; Johnson et al, 2016). As stated previously, recent results have demonstrated the importance of balancing HRLs and LRLs during multilingual training (Arivazhagan et al, 2019; Conneau et al, 2019), which is largely done with heuristic sam-

    multDDS, var=0.0012 multDDS-S, var=0.0003 −0.05

    multDDS, var=0.0015 multDDS-S, var=0.0005 −0.025

    multDDS, var=0.0026 multDDS-S, var=0.0007

    multDDS, var=0.0018 multDDS-S, var=0.0004

    0.25 0.20 0.15 0.10 0.05 0.00 0 kor bul mkd mar fra ell hin bos pling using a temperature term; MultiDDS provides a more effective and less heuristic method. Wang and Neubig (2019); Lin et al (2019) choose languages from multilingual data to improve the performance on a particular language, while our work instead aims to train a single model that handles translation between many languages. (Zaremoodi et al, 2018; Wang et al, 2018, 2019a) propose improvements to the model architecture to improve multilingual performance, while MultiDDS is a model-agnostic and optimizes multilingual data usage.

    Our work is also related to machine learning methods that balance multitask learning (Chen et al, 2018b; Kendall et al, 2018). For example, Kendall et al (2018) proposes to weigh the training loss from a multitask model based on the uncertainty of each task. Our method focuses on optimizing the multilingual data usage, and is both somewhat orthogonal to and less heuristic than such loss weighting methods. Finally, our work is related to meta-learning, which is used in hyperparameter optimization (Baydin et al, 2018), model initialization for fast adaptation (Finn et al, 2017), and data weighting (Ren et al, 2018). Notably, Gu et al (2018) apply meta-learning to learn an NMT model initialization for a set of languages, so that it can be quickly fine-tuned for any language. This is different in motivation from our method because it requires an adapted model for each of the language, while our method aims to optimize a single model to support all languages. To our knowledge, our work is the first to apply meta-learning to optimize data usage for multilingual objectives.
Funding
  • The first author is supported by a research grant from the Tang Family Foundation
  • This work was supported in part by NSF grant IIS-1812327
Reference
  • Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively multilingual neural machine translation. In NAACL.
    Google ScholarFindings
  • Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, and Noah A Smith. 2016. Many languages, one parser. TACL, 4:431–444.
    Google ScholarLocate open access versionFindings
  • Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. 2019. Massively multilingual neural
    Google ScholarFindings
  • Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual distributed representations of words. In Proceedings of COLING 2012, pages 1459–1474.
    Google ScholarLocate open access versionFindings
  • Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP.
    Google ScholarFindings
  • Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell, and Graham Neubig. 2019. Choosing transfer languages for cross-lingual learning. In ACL.
    Google ScholarFindings
  • Chaitanya Malaviya, Graham Neubig, and Patrick Littell. 201Learning language representations for typology prediction. In EMNLP.
    Google ScholarFindings
  • David M. Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009. Polylingual topic models. In EMNLP.
    Google ScholarFindings
  • Graham Neubig and Junjie Hu. 2018. Rapid adaptation of neural machine translation to new languages. EMNLP.
    Google ScholarLocate open access versionFindings
  • Toan Q. Nguyen and David Chiang. 2018. Transfer learning across low-resource, related languages for neural machine translation. In NAACL.
    Google ScholarFindings
  • Toan Q. Nguyen and Julian Salazar. 2019. Transformers without tears: Improving the normalization of self-attention. In IWSLT.
    Google ScholarFindings
  • Robert Ostling and Jorg Tiedemann. 2017. Continuous multilinguality with language vectors. In EACL.
    Google ScholarFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In NAACL: Demonstrations.
    Google ScholarFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
    Google ScholarFindings
  • Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In ACL.
    Google ScholarFindings
  • Matt Post. 2018. A call for clarity in reporting BLEU scores. In WMT.
    Google ScholarFindings
  • Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? In NAACL.
    Google ScholarLocate open access versionFindings
  • Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. 20Learning to reweight examples for robust deep learning. In ICML.
    Google ScholarFindings
  • Tanja Schultz and Alex Waibel. 1998. Multilingual and crosslingual speech recognition. In Proc. DARPA Workshop on Broadcast News Transcription and Understanding. Citeseer.
    Google ScholarLocate open access versionFindings
  • Lei Shi, Rada Mihalcea, and Mingjun Tian. 2010. Cross language text classification by model translation and semi-supervised learning. In EMNLP.
    Google ScholarFindings
  • Oscar Tackstrom, Dipanjan Das, Slav Petrov, Ryan McDonald, and Joakim Nivre. 2013. Token and type constraints for cross-lingual part-of-speech tagging. In TACL.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
    Google ScholarFindings
  • Xinyi Wang and Graham Neubig. 2019. Target conditioned sampling: Optimizing data selection for multilingual neural machine translation. In ACL.
    Google ScholarFindings
  • Xinyi Wang, Hieu Pham, Philip Arthur, and Graham Neubig. 2019a. Multilingual neural machine translation with soft decoupled encoding. In ICLR.
    Google ScholarFindings
  • Xinyi Wang, Hieu Pham, Paul Mitchel, Antonis Anastasopoulos, Jaime Carbonell, and Graham Neubig. 2019b. Optimizing data usage via differentiable rewards. In arxiv.
    Google ScholarFindings
  • Yining Wang, Jiajun Zhang, Feifei Zhai, Jingfang Xu, and Chengqing Zong. 2018. Three strategies to improve one-to-many multilingual translation. In EMNLP.
    Google ScholarFindings
  • Ronald J. Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning.
    Google ScholarFindings
  • Shijie Wu and Mark Dredze. 2019.
    Google ScholarFindings
  • Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A. Smith, and Jaime Carbonell. 2018. Neural crosslingual named entity recognition with minimal resources. In EMNLP.
    Google ScholarFindings
  • Poorya Zaremoodi, Wray L. Buntine, and Gholamreza Haffari. 2018. Adaptive knowledge sharing in multitask learning: Improving low-resource neural machine translation. In ACL.
    Google ScholarFindings
  • Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low resource neural machine translation. In EMNLP.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments