Effective Domain Mixing for Neural Machine Translation

    Denny Britz
    Denny Britz
    Reid Pryzant
    Reid Pryzant

    WMT, pp. 118-126, 2017.

    Cited by: 4|Bibtex|Views17|Links
    EI
    Keywords:
    Long Short-Term MemoryNeural Machine Translationcomposite datumdomain datumtest timeMore(12+)
    Wei bo:
    We presented three novel models for applying Neural Machine Translation to multidomain settings, and demonstrated their efficacy across six domains in three language pairs, and in the process achieved a new stateof-the-art in EN-JA translation

    Abstract:

    Neural Machine Translation (NMT) models are often trained on heterogeneous mixtures of domains, from news to parliamentary proceedings, each with unique distributions and language. In this work we show that training NMT systems on naively mixed data can degrade performance versus models fit to each constituent domain. We demonstrate that ...More

    Code:

    Data:

    0
    Introduction
    • Neural Machine Translation (NMT) (Kalchbrenner and Blunsom, 2013; Sutskever et al, 2014; Cho et al, 2014) is an end-to-end approach for automated translation.
    • This setting differs from the majority of work in domain adaptation, which explores how models trained on some source domain can be effectively applied to outside target domains.
    • This setting is important, because previous research has shown that both standard NMT and adaptation methods degrade performance on the original source domain(s) (Farajian et al, 2017; Haddow and Koehn, 2012).
    • The authors seek to prove that this problem can be overcome, and hypothesize that leveraging the heterogeneity of composite data rather than dampening it will allow them to do so
    Highlights
    • Neural Machine Translation (NMT) (Kalchbrenner and Blunsom, 2013; Sutskever et al, 2014; Cho et al, 2014) is an end-to-end approach for automated translation
    • Our problem space is as follows: how can we train a translation model on multi-domain data to improve test-time performance in each constituent domain? This setting differs from the majority of work in domain adaptation, which explores how models trained on some source domain can be effectively applied to outside target domains
    • This setting is important, because previous research has shown that both standard Neural Machine Translation and adaptation methods degrade performance on the original source domain(s) (Farajian et al, 2017; Haddow and Koehn, 2012)
    • We presented three novel models for applying Neural Machine Translation to multidomain settings, and demonstrated their efficacy across six domains in three language pairs, and in the process achieved a new stateof-the-art in EN-JA translation
    • All the proposed approaches outperform the naive combining of training data, so we advise practitioners to implement whichever most fits into their preexisting pipelines, but an approach based on a discriminator network offered the most reliable results
    Methods
    • For the Japanese translation task the authors evaluate the domain mixing techniques on the standard ASPEC corpus (Nakazawa et al, 2016) consisting of 3M scientific document sentence pairs, and the SubCrawl corpus, consisting of 3.2M colloquial sentence pairs harvested from freely available subtitle repositories on the World Wide Web. The authors use standard train/dev/test splits (3M, 1.8k, and 1.8k examples, respectively) and preprocess the data using subword units1 (Sennrich et al, 2015) to learn a shared English-Japanese vocabulary of size 32,000.
    • For EN-FR, the authors use professional translations of European Parliament Proceedings (Koehn, 2005) and a 2016 dump of the OpenSubtitles database (Lison and Tiedemann, 2016)
    Results
    • The results of the proxy-A distance experiment are given in Table 1. dA is a purely comparative metric that has little meaning in isolation (Ben-David et al, 2007), so it is evident that the EN-JA and EN-ZH domains are more disparate, while the EN-FR domains are more similar.
    • Japanese Chinese French Domain 1.
    • The authors' results support the hypothesis that the naive concatenation of data from disparate domains can degrade in-domain translation quality (Table 2).
    • In both the ENJA and EN-FR settings, the domains undergoing mixing are disparate enough to degrade.
    • (a) Comparing the mixed-domain and individual-domain baselines (BLEUmixed − BLEUindividual) while varying domain distance.
    • The more different two domains are, the more their mixture degrades performance
    Conclusion
    • The authors presented three novel models for applying Neural Machine Translation to multidomain settings, and demonstrated their efficacy across six domains in three language pairs, and in the process achieved a new stateof-the-art in EN-JA translation.
    • Unlike the naive combining of training data, these models improve their translational ability on each constituent domain.
    • These models are the first of their kind to not require knowledge of each example’s domain at inference time.
    • All the proposed approaches outperform the naive combining of training data, so the authors advise practitioners to implement whichever most fits into their preexisting pipelines, but an approach based on a discriminator network offered the most reliable results.
    Summary
    • Introduction:

      Neural Machine Translation (NMT) (Kalchbrenner and Blunsom, 2013; Sutskever et al, 2014; Cho et al, 2014) is an end-to-end approach for automated translation.
    • This setting differs from the majority of work in domain adaptation, which explores how models trained on some source domain can be effectively applied to outside target domains.
    • This setting is important, because previous research has shown that both standard NMT and adaptation methods degrade performance on the original source domain(s) (Farajian et al, 2017; Haddow and Koehn, 2012).
    • The authors seek to prove that this problem can be overcome, and hypothesize that leveraging the heterogeneity of composite data rather than dampening it will allow them to do so
    • Methods:

      For the Japanese translation task the authors evaluate the domain mixing techniques on the standard ASPEC corpus (Nakazawa et al, 2016) consisting of 3M scientific document sentence pairs, and the SubCrawl corpus, consisting of 3.2M colloquial sentence pairs harvested from freely available subtitle repositories on the World Wide Web. The authors use standard train/dev/test splits (3M, 1.8k, and 1.8k examples, respectively) and preprocess the data using subword units1 (Sennrich et al, 2015) to learn a shared English-Japanese vocabulary of size 32,000.
    • For EN-FR, the authors use professional translations of European Parliament Proceedings (Koehn, 2005) and a 2016 dump of the OpenSubtitles database (Lison and Tiedemann, 2016)
    • Results:

      The results of the proxy-A distance experiment are given in Table 1. dA is a purely comparative metric that has little meaning in isolation (Ben-David et al, 2007), so it is evident that the EN-JA and EN-ZH domains are more disparate, while the EN-FR domains are more similar.
    • Japanese Chinese French Domain 1.
    • The authors' results support the hypothesis that the naive concatenation of data from disparate domains can degrade in-domain translation quality (Table 2).
    • In both the ENJA and EN-FR settings, the domains undergoing mixing are disparate enough to degrade.
    • (a) Comparing the mixed-domain and individual-domain baselines (BLEUmixed − BLEUindividual) while varying domain distance.
    • The more different two domains are, the more their mixture degrades performance
    • Conclusion:

      The authors presented three novel models for applying Neural Machine Translation to multidomain settings, and demonstrated their efficacy across six domains in three language pairs, and in the process achieved a new stateof-the-art in EN-JA translation.
    • Unlike the naive combining of training data, these models improve their translational ability on each constituent domain.
    • These models are the first of their kind to not require knowledge of each example’s domain at inference time.
    • All the proposed approaches outperform the naive combining of training data, so the authors advise practitioners to implement whichever most fits into their preexisting pipelines, but an approach based on a discriminator network offered the most reliable results.
    Tables
    • Table1: Proxy A-distances (dA) for each domain pair
    • Table2: BLEU scores for models trained on various domains and languages (both mixed and unmixed). Rows correspond to training domains and columns correspond to test domains. Note that our single-domain ASPEC results are state-of-the-art, indicating the strength of these baselines
    Download tables as Excel
    Related work
    • Our work builds on a recent literature on domain adaptation strategies in Neural Machine Translation. Prior work in this space has proposed two general categories of methods.

      The first proposed method is to take models trained on the source domain and finetune on target-domain data. Luong and Manning (2015); Zoph et al (2016) explores how to improve transfer learning for a low-resource language pair by finetuning only parts of the network. Chu et al (2017) empirically evaluate domain adaptation methods and propose mixing source and target domain data during finetuning. Freitag and Al-Onaizan (2016) explored finetuning using only a small subset of target domain data. Note that we did not compare directly against these techniques because they are intended to transfer knowledge to a new domain and perform well on only the target domain. We are concerned with multi-domain settings, where performance on all constituent domains is important.
    Funding
    • Shows that training NMT systems on naively mixed data can degrade performance versus models fit to each constituent domain
    • Demonstrates that this problem can be circumvented, and propose three models that do so by jointly learning domain discrimination and translation
    • Demonstrates the efficacy of these techniques by merging pairs of domains in three languages: Chinese, French, and Japanese
    • Evaluates on pairs of linguistically disparate corpora in three translation tasks , and observe that unlike naively training on mixed data, the proposed techniques consistently improve translation quality in each individual setting
    Reference
    • Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
      Findings
    • Anonymous. 2017. Subcrawl: A colloquial parallel corpus for english-japanese translation. Manuscript submitted for publication..
      Google ScholarFindings
    • Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pages 355–362.
      Google ScholarLocate open access versionFindings
    • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 201Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
      Findings
    • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 201Neural machine translation by jointly learning to align and translate. In ICLR.
      Google ScholarFindings
    • Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. 2007. Analysis of representations for domain adaptation. Advances in neural information processing systems 19:137.
      Google ScholarLocate open access versionFindings
    • D. Britz, A. Goldie, T. Luong, and Q. Le. 201Massive Exploration of Neural Machine Translation Architectures. ArXiv e-prints.
      Google ScholarFindings
    • Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.
      Google ScholarFindings
    • Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017. An empirical comparison of simple domain adaptation methods for neural machine translation. CoRR abs/1701.03214. http://arxiv.org/abs/1701.03214.
      Findings
    • Jonathan H Clark, Alon Lavie, and Chris Dyer. 2012. One system, many domains: Opendomain statistical machine translation via feature augmentation.
      Google ScholarFindings
    • Jeffrey L Elman. 1990. Finding structure in time. Cognitive science 14(2):179–211.
      Google ScholarLocate open access versionFindings
    • M Amin Farajian, Marco Turchi, Matteo Negri, Nicola Bertoldi, and Marcello Federico. 2017. Neural vs. phrase-based machine translation in a multi-domain scenario. EACL 2017 page 280.
      Google ScholarLocate open access versionFindings
    • Markus Freitag and Yaser Al-Onaizan. 2016. Fast domain adaptation for neural machine translation. CoRR abs/1612.06897. http://arxiv.org/abs/1612.06897.
      Findings
    • Yaroslav Gani, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2015. Domain-adversarial training of neural networks. arxiv preprint. arXiv preprint arXiv:1505.07818.
      Findings
    • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. 20Domain-Adversarial Training of Neural Networks. ArXiv e-prints.
      Google ScholarFindings
    • Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML11). pages 513–520.
      Google ScholarLocate open access versionFindings
    • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. pages 2672–2680.
      Google ScholarLocate open access versionFindings
    • Barry Haddow and Philipp Koehn. 2012. Analysing the effect of out-of-domain data on smt systems. In Proceedings of the Seventh Workshop on Statistical Machine Translation. Association for Computational Linguistics, pages 422–432.
      Google ScholarLocate open access versionFindings
    • Motoko Hori. 1986. A sociolinguistic analysis of the japanese honorifics. Journal of pragmatics 10(3):373–386.
      Google ScholarLocate open access versionFindings
    • Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR abs/1611.04558. http://arxiv.org/abs/1611.04558.
      Findings
    • Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In EMNLP.
      Google ScholarFindings
    • Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
      Findings
    • Catherine Kobus, Josep Crego, and Jean Senellart. 2016. Domain control for neural machine translation. arXiv preprint arXiv:1612.06140.
      Findings
    • Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit. volume 5, pages 79–86.
      Google ScholarLocate open access versionFindings
    • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, pages 177–180.
      Google ScholarLocate open access versionFindings
    • Pierre Lison and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation.
      Google ScholarLocate open access versionFindings
    • Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. 2015. Learning transferable features with deep adaptation networks. In ICML. pages 97–105.
      Google ScholarLocate open access versionFindings
    • Minh-Thang Luong and Christopher D Manning. 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of the International Workshop on Spoken Language Translation.
      Google ScholarLocate open access versionFindings
    • Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches to attention-based neural machine translation. In EMNLP.
      Google ScholarFindings
    • Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015b. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
      Findings
    • Arindam Mandal, Dimitra Vergyri, Wen Wang, Jing Zheng, Andreas Stolcke, Gokhan Tur, D Hakkani-Tur, and Necip Fazil Ayan. 2008. Efficient data selection for machine translation. In Spoken Language Technology Workshop, 2008. SLT 2008. IEEE. IEEE, pages 261–264.
      Google ScholarLocate open access versionFindings
    • Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi, and Hitoshi Isahara. 2016. Aspec: Asian scientific paper excerpt corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2016). pages 2204–2208.
      Google ScholarLocate open access versionFindings
    • Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise prediction for robust, adaptable japanese morphological analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papersVolume 2. Association for Computational Linguistics, pages 529–533.
      Google ScholarLocate open access versionFindings
    • Pavel Pecina, Antonio Toral, and Josef Van Genabith. 2012. Simple and effective parameter tuning for domain adaptation of statistical machine translation. In COLING. pages 2209–2224.
      Google ScholarLocate open access versionFindings
    • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
      Findings
    • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
      Google ScholarFindings
    • Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In LREC. volume 2012, pages 2214–2218.
      Google ScholarLocate open access versionFindings
    • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144.
      Findings
    • Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016.
      Google ScholarFindings
    • http://arxiv.org/abs/1604.02201.
      Findings
    Your rating :
    0

     

    Tags
    Comments