Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation

Zhang Biao
Zhang Biao
Williams Philip
Williams Philip

ACL, pp. 1628-1639, 2020.

Cited by: 0|Bibtex|Views49|Links
EI
Keywords:
nmt modelmultilingual neural machine translationmultiple languageshot performanceneural machine translationMore(15+)
Weibo:
We show that multilingual neural machine translation suffers from weak capacity, and propose to enhance it by deepening the Transformer and devising language-aware neural models

Abstract:

Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations. In this paper, we explore ways to improve them. We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typolo...More
0
Introduction
Highlights
  • With the great success of neural machine translation (NMT) on bilingual datasets (Bahdanau et al, 2015; Vaswani et al, 2017; Barrault et al, 2019), there is renewed interest in multilingual translation where a single neural machine translation model is optimized for the translation of multiple language pairs (Firat et al, 2016a; Johnson et al, 2017; Lu et al, 2018; Aharoni et al, 2019)
  • This paper explores approaches to improve massively multilingual neural machine translation, especially on zero-shot translation
  • We show that multilingual neural machine translation suffers from weak capacity, and propose to enhance it by deepening the Transformer and devising language-aware neural models
  • We find that multilingual neural machine translation often generates off-target translations on zero-shot directions, and propose to correct it with a random online backtranslation algorithm
  • We empirically demonstrate the feasibility of backtranslation in massively multilingual settings to allow for massively zero-shot translation for the first time
Methods
  • The authors perform one-to-many (English-X) and manyto-many (English-X ∪ X-English) translation on OPUS-100 (|T | is 100).
  • The authors randomly shuffle the training set to mix instances of different language pairs.
  • The authors adopt BLEU (Papineni et al, 2002) for translation evaluation with the toolkit SacreBLEU (Post, 2018)8.
  • Rather than providing numbers for each language pair, the authors report average BLEU over all 94 language pairs with test sets (BLEU94).
  • The authors show the win ratio (WR), counting the proportion where the approach outperforms its baseline
Results
  • Results on

    One-to-Many Translation

    Table 2 summarizes the results. The inferior performance of multilingual NMT ( 3 ) against its

    9https://github.com/Mimino666/ langdetect

    L #Param w/o ROBT w/ ROBT

    BLEU94 WR BLEU4 BLEU94 WR BLEU4

    w/o ROBT bilingual counterpart ( 1 ) reflects the capacity issue (-1.95 BLEU4).
  • The authors' results in Table 6 reveal that the translation quality is rather poor (3.97 BLEUzero, 2 w/o ROBT) compared to the pivot-based bilingual baseline (12.98 BLEUzero, 1 ) under the massively multilingual setting (Aharoni et al, 2019), translations into different target languages show varied performance.
  • Results in Table 6 show that there is a huge accuracy gap between the multilingual and the pivot-based method (-48.83% ACCzero, 1 → 2 , w/o ROBT), from which the authors conclude that the off-target translation issue is one source of the poor zero-shot performance.
  • In other words, increasing the modeling capacity benefits zero-shot translation and improves robustness
Conclusion
  • This paper explores approaches to improve massively multilingual NMT, especially on zero-shot translation.
  • The authors release OPUS-100, a multilingual dataset from OPUS including 100 languages with around 55M sentence pairs for future study.
  • The authors' experiments on this dataset show that the proposed approaches substantially increase translation performance, narrowing the performance gap with bilingual NMT models and pivot-based methods
Summary
  • Introduction:

    With the great success of neural machine translation (NMT) on bilingual datasets (Bahdanau et al, 2015; Vaswani et al, 2017; Barrault et al, 2019), there is renewed interest in multilingual translation where a single NMT model is optimized for the translation of multiple language pairs (Firat et al, 2016a; Johnson et al, 2017; Lu et al, 2018; Aharoni et al, 2019).
  • Multilingual NMT eases model deployment and can encourage knowledge transfer among related language pairs (Lakew et al, 2018; Tan et al, 2019), improve low-resource translation (Ha et al, 2016; Arivazhagan et al, 2019b), Source Reference Zero-Shot Source Reference Zero-Shot
  • Methods:

    The authors perform one-to-many (English-X) and manyto-many (English-X ∪ X-English) translation on OPUS-100 (|T | is 100).
  • The authors randomly shuffle the training set to mix instances of different language pairs.
  • The authors adopt BLEU (Papineni et al, 2002) for translation evaluation with the toolkit SacreBLEU (Post, 2018)8.
  • Rather than providing numbers for each language pair, the authors report average BLEU over all 94 language pairs with test sets (BLEU94).
  • The authors show the win ratio (WR), counting the proportion where the approach outperforms its baseline
  • Results:

    Results on

    One-to-Many Translation

    Table 2 summarizes the results. The inferior performance of multilingual NMT ( 3 ) against its

    9https://github.com/Mimino666/ langdetect

    L #Param w/o ROBT w/ ROBT

    BLEU94 WR BLEU4 BLEU94 WR BLEU4

    w/o ROBT bilingual counterpart ( 1 ) reflects the capacity issue (-1.95 BLEU4).
  • The authors' results in Table 6 reveal that the translation quality is rather poor (3.97 BLEUzero, 2 w/o ROBT) compared to the pivot-based bilingual baseline (12.98 BLEUzero, 1 ) under the massively multilingual setting (Aharoni et al, 2019), translations into different target languages show varied performance.
  • Results in Table 6 show that there is a huge accuracy gap between the multilingual and the pivot-based method (-48.83% ACCzero, 1 → 2 , w/o ROBT), from which the authors conclude that the off-target translation issue is one source of the poor zero-shot performance.
  • In other words, increasing the modeling capacity benefits zero-shot translation and improves robustness
  • Conclusion:

    This paper explores approaches to improve massively multilingual NMT, especially on zero-shot translation.
  • The authors release OPUS-100, a multilingual dataset from OPUS including 100 languages with around 55M sentence pairs for future study.
  • The authors' experiments on this dataset show that the proposed approaches substantially increase translation performance, narrowing the performance gap with bilingual NMT models and pivot-based methods
Tables
  • Table1: Illustration of the off-target translation issue with French→German zero-shot translations with a multilingual NMT model. Our baseline multilingual NMT model often translates into the wrong language for zero-shot language pairs, such as copying the source sentence or translating into English rather than German
  • Table2: Test BLEU for one-to-many translation on OPUS-100 (100 languages). “Bilingual”: bilingual NMT, “L”: model depth (for both encoder and decoder), “#Param”: parameter number, “WR”: win ratio (%) compared to ref ( 3 ), MATT: the merged attention (<a class="ref-link" id="cZhang_et+al_2019_a" href="#rZhang_et+al_2019_a">Zhang et al, 2019</a>). LALN and LALT denote the proposed language-aware layer normalization and linear transformation, respectively. “BLEU94/BLEU4”: average BLEU over all 94 translation directions in test set and En→De/Zh/Br/Te, respectively. Higher BLEU and WR indicate better result. Best scores are highlighted in bold
  • Table3: English→X test BLEU for many-to-many translation on OPUS-100 (100 languages). “WR”: win ratio (%) compared to ref ( 2 w/o ROBT). ROBT denotes the proposed random online backtranslation method
  • Table4: X→English test BLEU for many-to-many translation on OPUS-100 (100 languages). “WR”: win ratio (%) compared to ref ( 2 w/o ROBT)
  • Table5: Test BLEU for High/Medium/Low (High/Med/Low) resource language pairs in many-to-many setting on OPUS-100 (100 languages). We report average BLEU for each category
  • Table6: Test BLEU and translation-language accuracy for zero-shot translation in many-to-many setting on OPUS-100 (100 languages). “BLEUzero/ACCzero”: average BLEU/accuracy over all zero-shot translation directions in test set, “Pivot”: the pivot-based translation that first translates one source sentence into English (X→English NMT), and then into the target language (English→X NMT). Lower accuracy indicates severe off-target translation. The average Pearson correlation coefficient between language accuracy and the corresponding BLEU is 0.93 (significant at p < 0.01)
  • Table7: Zero-short translation quality for ROBT under different settings. “100-to-100”: the setting used in the above experiments; we set T to all target languages. “6-to-6”: T only includes the zero-shot languages in the test set. We employ 6-layer Transformer with LALN and LALT for experiments
  • Table8: Numbers of training, validation, and test sentence pairs in the English-centric multilingual dataset
Download tables as Excel
Related work
  • Pioneering work on multilingual NMT began with multitask learning, which shared the encoder for one-to-many translation (Dong et al, 2015) or the attention mechanism for many-to-many translation (Firat et al, 2016a). These methods required a dedicated encoder or decoder for each language, limiting their scalability. By contrast, Lee et al (2017) exploited character-level inputs and adopted a shared encoder for many-to-one translation. Ha et al (2016) and Johnson et al (2017) further successfully trained a single NMT model for multilingual translation with a target language symbol guiding the translation direction. This approach serves as our baseline. Still, this paradigm forces different languages into one joint representation space, neglecting their linguistic diversity. Several subsequent studies have explored different strategies to mitigate this representation bottleneck, ranging from reorganizing parameter sharing (Blackwood et al, 2018; Sachan and Neubig, 2018; Lu et al, 2018; Wang et al, 2019c; Vázquez et al, 2019), designing language-specific parameter generators (Platanios et al, 2018), decoupling multilingual word encodings (Wang et al, 2019b) to language clustering (Tan et al, 2019). Our languagespecific modeling continues in this direction, but with a special focus on broadening normalization layers and encoder outputs.
Funding
  • This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreements 825460 (ELITR) and 825299 (GoURMET)
  • This project has received support from Samsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland
  • Rico Sennrich acknowledges support of the Swiss National Science Foundation (MUTAMUR; no. 176727)
Reference
  • Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3874–3884, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Maruan Al-Shedivat and Ankur Parikh. 2019. Consistency by agreement in zero-shot neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1184–1197, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, and Wolfgang Macherey. 2019a. The missing ingredient in zero-shot neural machine translation. CoRR, abs/1903.07091.
    Findings
  • Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. 2019b. Massively multilingual neural machine translation in the wild: Findings and challenges. CoRR, abs/1907.05019.
    Findings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
    Findings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Ankur Bapna, Mia Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. 2018. Training deeper neural machine translation models with transparent attention. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3028–3033, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ankur Bapna and Orhan Firat. 2019.
    Google ScholarFindings
  • Simple, scalable adaptation for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1538– 1548, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Loïc Barrault, Ondrej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Graeme Blackwood, Miguel Ballesteros, and Todd Ward. 2018. Multilingual neural machine translation with task-specific attention. In Proceedings of the 27th International Conference on Computational
    Google ScholarLocate open access versionFindings
  • Linguistics, pages 3112–3122, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarFindings
  • multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
    Google ScholarLocate open access versionFindings
  • Anna Currey and Kenneth Heafield. 2019. Zeroresource neural machine translation with monolingual pivot data. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 99–107, Hong Kong. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 20Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1723–1732, Beijing, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016a. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 866–875, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Orhan Firat, Baskaran Sankaran, Yaser Al-onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. 2016b. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 268–277, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Xavier García, Pierre Forêt, Thibault Sellam, and Ankur P. Parikh. 2020. A multilingual view of unsupervised machine translation. ArXiv, abs/2002.02955.
    Findings
  • Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O.K. Li. 20Improved zero-shot neural machine translation via ignoring spurious correlations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1258–1268, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thanh-Le Ha, Jan Niehues, and Alexander Waibel. 2016. Toward multilingual neural machine translation with universal encoder and decoder. In Proceedings of the 13th International Workshop on Spoken Language Translation (IWSLT), Seattle, USA.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR, abs/1512.03385.
    Findings
  • Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn. 2010. Statistical Machine Translation, 1st edition. Cambridge University Press, New York, NY, USA.
    Google ScholarFindings
  • Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Surafel M. Lakew, Marcello Federico, Matteo Negri, and Marco Turchi. 2019. Multilingual Neural Machine Translation for Zero-Resource Languages. arXiv e-prints, page arXiv:1909.07342.
    Findings
  • Surafel Melaku Lakew, Mauro Cettolo, and Marcello Federico. 2018. A comparison of transformer and recurrent neural networks on multilingual neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 641–652, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics, 5:365–378.
    Google ScholarLocate open access versionFindings
  • Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, and Jason Sun. 2018. A neural interlingua for multilingual machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 84–92, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Emmanouil Antonios Platanios, Mrinmaya Sachan, Graham Neubig, and Tom Mitchell. 2018. Contextual parameter generation for universal neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 425–435, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on
    Google ScholarLocate open access versionFindings
  • Machine Translation: Research Papers, pages 186– 191, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Devendra Sachan and Graham Neubig. 2018. Parameter sharing methods for multilingual self-attentional translation models. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 261–271, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jinsong Su, Shan Wu, Deyi Xiong, Yaojie Lu, Xianpei Han, and Biao Zhang. 2018. Variational recurrent neural machine translation. In Thirty-Second AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Xu Tan, Jiale Chen, Di He, Yingce Xia, Tao QIN, and Tie-Yan Liu. 2019. Multilingual neural machine translation with language clustering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 963–973, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019a. Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1810–1822, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Xinyi Wang, Hieu Pham, Philip Arthur, and Graham Neubig. 2019b. Multilingual neural machine translation with soft decoupled encoding. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Yining Wang, Long Zhou, Jiajun Zhang, Feifei Zhai, Jingfang Xu, and Chengqing Zong. 2019c. A compact and language-sensitive multilingual translation method. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1213–1223, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Biao Zhang, Ivan Titov, and Rico Sennrich. 2019. Improving deep transformer with depth-scaled initialization and merged attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 898–909, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Biao Zhang, Deyi Xiong, Jinsong Su, Hong Duan, and Min Zhang. 2016. Variational neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 521–530, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zaixiang Zheng, Hao Zhou, Shujian Huang, Lei Li, Xin-Yu Dai, and Jiajun Chen. 2020. Mirrorgenerative neural machine translation. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Raúl Vázquez, Alessandro Raganato, Jörg Tiedemann, and Mathias Creutz. 2019. Multilingual NMT with a language-independent attention bridge. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 33–39, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments