Multilingual Neural Machine Translation With Soft Decoupled Encoding

ICLR, 2019.

Cited by: 10|Bibtex|Views45|Links
EI
Keywords:
parameter efficientsemantic meaningstrong multilingual nmt baselineLow Resource Languages for Emergent Incidentsmultilingual neural machine translationMore(4+)
Weibo:
Acknowledgements: The authors thank David Mortensen for helpful comments, and Amazon for providing GPU credits. This material is based upon work supported in part by the Defense Advanced Research Projects Agency Information Innovation Office Low Resource Languages for Emergent In...

Abstract:

Multilingual training of neural machine translation (NMT) systems has led to impressive accuracy improvements on low-resource languages. However, there are still significant challenges in efficiently learning word representations in the face of paucity of data. In this paper, we propose Soft Decoupled Encoding (SDE), a multilingual lexico...More

Code:

Data:

0
Introduction
  • Multilingual Neural Machine Translation (NMT) has shown great potential both in creating parameter-efficient MT systems for many languages (Johnson et al, 2016), and in improving translation quality of low-resource languages (Zoph et al, 2016; Firat et al, 2016; Gu et al, 2018; Neubig & Hu, 2018; Nguyen & Chiang, 2018).
  • The standard sequence-tosequence NMT model (Sutskever et al, 2014) represents each lexical unit by a vector from a look-up table, making it difficult to share across different languages with limited lexicon overlap.
  • This problem is salient when translating low-resource languages, where there is not sufficient data to fully train the word embeddings.
  • This problem is especially salient when the high-resource language dominates the training data
Highlights
  • Multilingual Neural Machine Translation (NMT) has shown great potential both in creating parameter-efficient MT systems for many languages (Johnson et al, 2016), and in improving translation quality of low-resource languages (Zoph et al, 2016; Firat et al, 2016; Gu et al, 2018; Neubig & Hu, 2018; Nguyen & Chiang, 2018)
  • Despite the success of multilingual Neural Machine Translation, it remains a research question how to represent the words from multiple languages in a way that is both parameter efficient and conducive to cross-lingual generalization
  • We propose Soft Decoupled Encoding (SDE), a multilingual lexicon representation framework that obviates the need for segmentation by representing words on a full-word level, but can share parameters intelligently, aiding generalization
  • Sub-joint is worse than sub-sep it allows complete sharing of lexical units between languages, probably because sub-joint leads to over-segmentation for the low-resource language
  • We show that Soft Decoupled Encoding can intelligently leverage the word similarities between two related languages by softly decoupling the lexical and semantic representations of the words
  • Acknowledgements: The authors thank David Mortensen for helpful comments, and Amazon for providing GPU credits. This material is based upon work supported in part by the Defense Advanced Research Projects Agency Information Innovation Office (I2O) Low Resource Languages for Emergent Incidents (LORELEI) program under Contract No HR0011-15-C0114
Methods
  • Before the authors describe the specific architecture in detail (Section 3.2), given these desiderata discussed above, the authors summarize the.
Results
  • Table 3 presents the results of SDEand of other baselines.
  • For the three baselines using lookup, subsep achieves the best performance for three of the four languages.
  • The authors' reimplementation of universal encoder (Gu et al, 2018) does not perform well either, probably because the monolingual embedding is not trained on enough data, or the hyperparamters for their method are harder to tune.
  • SDE outperforms the best baselines for all four languages, without using subword units or extra monolingual data
Conclusion
  • Existing methods of lexical representation for multilingual NMT hinder parameter sharing between words that share similar surface forms and/or semantic meanings.
  • Acknowledgements: The authors thank David Mortensen for helpful comments, and Amazon for providing GPU credits.
  • This material is based upon work supported in part by the Defense Advanced Research Projects Agency Information Innovation Office (I2O) Low Resource Languages for Emergent Incidents (LORELEI) program under Contract No HR0011-15-C0114.
  • The U.S Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on
Summary
  • Introduction:

    Multilingual Neural Machine Translation (NMT) has shown great potential both in creating parameter-efficient MT systems for many languages (Johnson et al, 2016), and in improving translation quality of low-resource languages (Zoph et al, 2016; Firat et al, 2016; Gu et al, 2018; Neubig & Hu, 2018; Nguyen & Chiang, 2018).
  • The standard sequence-tosequence NMT model (Sutskever et al, 2014) represents each lexical unit by a vector from a look-up table, making it difficult to share across different languages with limited lexicon overlap.
  • This problem is salient when translating low-resource languages, where there is not sufficient data to fully train the word embeddings.
  • This problem is especially salient when the high-resource language dominates the training data
  • Methods:

    Before the authors describe the specific architecture in detail (Section 3.2), given these desiderata discussed above, the authors summarize the.
  • Results:

    Table 3 presents the results of SDEand of other baselines.
  • For the three baselines using lookup, subsep achieves the best performance for three of the four languages.
  • The authors' reimplementation of universal encoder (Gu et al, 2018) does not perform well either, probably because the monolingual embedding is not trained on enough data, or the hyperparamters for their method are harder to tune.
  • SDE outperforms the best baselines for all four languages, without using subword units or extra monolingual data
  • Conclusion:

    Existing methods of lexical representation for multilingual NMT hinder parameter sharing between words that share similar surface forms and/or semantic meanings.
  • Acknowledgements: The authors thank David Mortensen for helpful comments, and Amazon for providing GPU credits.
  • This material is based upon work supported in part by the Defense Advanced Research Projects Agency Information Innovation Office (I2O) Low Resource Languages for Emergent Incidents (LORELEI) program under Contract No HR0011-15-C0114.
  • The U.S Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on
Tables
  • Table1: Methods for lexical representation in multilingual NMT
  • Table2: Table 2
  • Table3: BLEU scores on four language pairs. Statistical significance is indicated with ∗ (p < 0.0001) and † (p < 0.05), compared with the best baseline
  • Table4: BLEU scores after removing each component from SDE-com. Statistical significance is indicated with ∗ (p < 0.0001) and † (p < 0.005), compared with the full model in the first row
  • Table5: BLEU scores on four language pairs. Statistical significance is indicated with ∗ (p < 0.0001) and † (p < 0.005), compared with the setting in row 1
  • Table6: BLEU scores for training with all four highthe best result on bel, with around 3 BLEU resource languages. over the best baseline. The performance of sub-sep, on the other hand, decreases by around 1.5 BLEU when training on all languages for bel. The performance of both methods decreases for aze when using all languages. SDE only slightly loses 0.1 BLEU while sub-sep loses over 3 BLEU
  • Table7: Examples of glg to eng translations
  • Table8: Bilingual word pairs and their subword pieces
  • Table9: Words in glg-por that have the same meaning but different spelling, or similar spelling but different meaning
Download tables as Excel
Funding
  • Proposes Soft Decoupled Encoding , a multilingual lexicon representation framework that obviates the need for segmentation by representing words on a full-word level, but can share parameters intelligently, aiding generalization
  • Finds in experiments in Section 4 that it is less robust than simple lookup when large monolingual data to pre-train embeddings is not available, which is the case for many low-resourced languages
Reference
  • Duygu Ataman and Marcello Federico. Compositional representation of morphologically-rich input for neural machine translation. ACL, 2018.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Daniel Chandler. Semiotic: The Basics. 2007.
    Google ScholarFindings
  • Colin Cherry, George Foster, Ankur Bapna, Orhan Firat, and Wolfgang Macherey. Revisiting character-based neural machine translation with capacity and compression. CoRR, 2018.
    Google ScholarLocate open access versionFindings
  • Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Smith. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In ACL, 2011.
    Google ScholarLocate open access versionFindings
  • Chris Dyer, Victor Chahuneau, and Noah A. Smith. A simple, fast, and effective reparameterization of IBM model 2. NAACL, 2013.
    Google ScholarFindings
  • Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. NAACL, 2016.
    Google ScholarLocate open access versionFindings
  • Algirdas Julien Greimas. Structural semantics: An attempt at a method. University of Nebraska Press, 1983.
    Google ScholarFindings
  • Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O. K. Li. Universal neural machine translation for extremely low resource languages. NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viegas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s multilingual neural machine translation system: Enabling zero-shot translation. TACL, 2016.
    Google ScholarLocate open access versionFindings
  • Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. Arxiv, 2016.
    Google ScholarLocate open access versionFindings
  • Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. Character-aware neural language models. AAAI, 2016.
    Google ScholarLocate open access versionFindings
  • Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. ACL, 2018.
    Google ScholarLocate open access versionFindings
  • Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully character-level neural machine translation without explicit segmentation. TACL, 2017.
    Google ScholarLocate open access versionFindings
  • Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attentionbased neural machine translation. In EMNLP, 2015.
    Google ScholarLocate open access versionFindings
  • Graham Neubig and Junjie Hu. Rapid adaptation of neural machine translation to new languages. EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Toan Q. Nguyen and David Chiang. Transfer learning across low-resource, related languages for neural machine translation. In NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. When and why are pre-trained word embeddings useful for neural machine translation? NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, 2016.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 2008.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. Charagram: Embedding words and sentences via character n-grams. EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • Wen-tau Yih, Xiaodong He, and Christopher Meek. Semantic parsing for single-relation question answering. ACL, 2014.
    Google ScholarLocate open access versionFindings
  • Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for low resource neural machine translation. EMNLP, 2016.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments