Multilingual Neural Machine Translation With Soft Decoupled Encoding
ICLR, 2019.
EI
Keywords:
parameter efficientsemantic meaningstrong multilingual nmt baselineLow Resource Languages for Emergent Incidentsmultilingual neural machine translationMore(4+)
Weibo:
Abstract:
Multilingual training of neural machine translation (NMT) systems has led to impressive accuracy improvements on low-resource languages. However, there are still significant challenges in efficiently learning word representations in the face of paucity of data. In this paper, we propose Soft Decoupled Encoding (SDE), a multilingual lexico...More
Code:
Data:
Introduction
- Multilingual Neural Machine Translation (NMT) has shown great potential both in creating parameter-efficient MT systems for many languages (Johnson et al, 2016), and in improving translation quality of low-resource languages (Zoph et al, 2016; Firat et al, 2016; Gu et al, 2018; Neubig & Hu, 2018; Nguyen & Chiang, 2018).
- The standard sequence-tosequence NMT model (Sutskever et al, 2014) represents each lexical unit by a vector from a look-up table, making it difficult to share across different languages with limited lexicon overlap.
- This problem is salient when translating low-resource languages, where there is not sufficient data to fully train the word embeddings.
- This problem is especially salient when the high-resource language dominates the training data
Highlights
- Multilingual Neural Machine Translation (NMT) has shown great potential both in creating parameter-efficient MT systems for many languages (Johnson et al, 2016), and in improving translation quality of low-resource languages (Zoph et al, 2016; Firat et al, 2016; Gu et al, 2018; Neubig & Hu, 2018; Nguyen & Chiang, 2018)
- Despite the success of multilingual Neural Machine Translation, it remains a research question how to represent the words from multiple languages in a way that is both parameter efficient and conducive to cross-lingual generalization
- We propose Soft Decoupled Encoding (SDE), a multilingual lexicon representation framework that obviates the need for segmentation by representing words on a full-word level, but can share parameters intelligently, aiding generalization
- Sub-joint is worse than sub-sep it allows complete sharing of lexical units between languages, probably because sub-joint leads to over-segmentation for the low-resource language
- We show that Soft Decoupled Encoding can intelligently leverage the word similarities between two related languages by softly decoupling the lexical and semantic representations of the words
- Acknowledgements: The authors thank David Mortensen for helpful comments, and Amazon for providing GPU credits. This material is based upon work supported in part by the Defense Advanced Research Projects Agency Information Innovation Office (I2O) Low Resource Languages for Emergent Incidents (LORELEI) program under Contract No HR0011-15-C0114
Methods
- Before the authors describe the specific architecture in detail (Section 3.2), given these desiderata discussed above, the authors summarize the.
Results
- Table 3 presents the results of SDEand of other baselines.
- For the three baselines using lookup, subsep achieves the best performance for three of the four languages.
- The authors' reimplementation of universal encoder (Gu et al, 2018) does not perform well either, probably because the monolingual embedding is not trained on enough data, or the hyperparamters for their method are harder to tune.
- SDE outperforms the best baselines for all four languages, without using subword units or extra monolingual data
Conclusion
- Existing methods of lexical representation for multilingual NMT hinder parameter sharing between words that share similar surface forms and/or semantic meanings.
- Acknowledgements: The authors thank David Mortensen for helpful comments, and Amazon for providing GPU credits.
- This material is based upon work supported in part by the Defense Advanced Research Projects Agency Information Innovation Office (I2O) Low Resource Languages for Emergent Incidents (LORELEI) program under Contract No HR0011-15-C0114.
- The U.S Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on
Summary
Introduction:
Multilingual Neural Machine Translation (NMT) has shown great potential both in creating parameter-efficient MT systems for many languages (Johnson et al, 2016), and in improving translation quality of low-resource languages (Zoph et al, 2016; Firat et al, 2016; Gu et al, 2018; Neubig & Hu, 2018; Nguyen & Chiang, 2018).- The standard sequence-tosequence NMT model (Sutskever et al, 2014) represents each lexical unit by a vector from a look-up table, making it difficult to share across different languages with limited lexicon overlap.
- This problem is salient when translating low-resource languages, where there is not sufficient data to fully train the word embeddings.
- This problem is especially salient when the high-resource language dominates the training data
Methods:
Before the authors describe the specific architecture in detail (Section 3.2), given these desiderata discussed above, the authors summarize the.Results:
Table 3 presents the results of SDEand of other baselines.- For the three baselines using lookup, subsep achieves the best performance for three of the four languages.
- The authors' reimplementation of universal encoder (Gu et al, 2018) does not perform well either, probably because the monolingual embedding is not trained on enough data, or the hyperparamters for their method are harder to tune.
- SDE outperforms the best baselines for all four languages, without using subword units or extra monolingual data
Conclusion:
Existing methods of lexical representation for multilingual NMT hinder parameter sharing between words that share similar surface forms and/or semantic meanings.- Acknowledgements: The authors thank David Mortensen for helpful comments, and Amazon for providing GPU credits.
- This material is based upon work supported in part by the Defense Advanced Research Projects Agency Information Innovation Office (I2O) Low Resource Languages for Emergent Incidents (LORELEI) program under Contract No HR0011-15-C0114.
- The U.S Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on
Tables
- Table1: Methods for lexical representation in multilingual NMT
- Table2: Table 2
- Table3: BLEU scores on four language pairs. Statistical significance is indicated with ∗ (p < 0.0001) and † (p < 0.05), compared with the best baseline
- Table4: BLEU scores after removing each component from SDE-com. Statistical significance is indicated with ∗ (p < 0.0001) and † (p < 0.005), compared with the full model in the first row
- Table5: BLEU scores on four language pairs. Statistical significance is indicated with ∗ (p < 0.0001) and † (p < 0.005), compared with the setting in row 1
- Table6: BLEU scores for training with all four highthe best result on bel, with around 3 BLEU resource languages. over the best baseline. The performance of sub-sep, on the other hand, decreases by around 1.5 BLEU when training on all languages for bel. The performance of both methods decreases for aze when using all languages. SDE only slightly loses 0.1 BLEU while sub-sep loses over 3 BLEU
- Table7: Examples of glg to eng translations
- Table8: Bilingual word pairs and their subword pieces
- Table9: Words in glg-por that have the same meaning but different spelling, or similar spelling but different meaning
Funding
- Proposes Soft Decoupled Encoding , a multilingual lexicon representation framework that obviates the need for segmentation by representing words on a full-word level, but can share parameters intelligently, aiding generalization
- Finds in experiments in Section 4 that it is less robust than simple lookup when large monolingual data to pre-train embeddings is not available, which is the case for many low-resourced languages
Reference
- Duygu Ataman and Marcello Federico. Compositional representation of morphologically-rich input for neural machine translation. ACL, 2018.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
- Daniel Chandler. Semiotic: The Basics. 2007.
- Colin Cherry, George Foster, Ankur Bapna, Orhan Firat, and Wolfgang Macherey. Revisiting character-based neural machine translation with capacity and compression. CoRR, 2018.
- Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Smith. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In ACL, 2011.
- Chris Dyer, Victor Chahuneau, and Noah A. Smith. A simple, fast, and effective reparameterization of IBM model 2. NAACL, 2013.
- Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. NAACL, 2016.
- Algirdas Julien Greimas. Structural semantics: An attempt at a method. University of Nebraska Press, 1983.
- Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O. K. Li. Universal neural machine translation for extremely low resource languages. NAACL, 2018.
- Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viegas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s multilingual neural machine translation system: Enabling zero-shot translation. TACL, 2016.
- Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. Arxiv, 2016.
- Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. Character-aware neural language models. AAAI, 2016.
- Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. ACL, 2018.
- Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully character-level neural machine translation without explicit segmentation. TACL, 2017.
- Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attentionbased neural machine translation. In EMNLP, 2015.
- Graham Neubig and Junjie Hu. Rapid adaptation of neural machine translation to new languages. EMNLP, 2018.
- Toan Q. Nguyen and David Chiang. Transfer learning across low-resource, related languages for neural machine translation. In NAACL, 2018.
- Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. When and why are pre-trained word embeddings useful for neural machine translation? NAACL, 2018.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, 2016.
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
- L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 2008.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
- John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. Charagram: Embedding words and sentences via character n-grams. EMNLP, 2016.
- Wen-tau Yih, Xiaodong He, and Christopher Meek. Semantic parsing for single-relation question answering. ACL, 2014.
- Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for low resource neural machine translation. EMNLP, 2016.
Tags
Comments