Semi-Supervised Learning for Neural Machine Translation

meeting of the association for computational linguistics, 2016.

Cited by: 139|Bibtex|Views253|Links
EI
Keywords:
NISTparameter estimationmathematicsrecurrent neural networkscentral ideaMore(13+)
Weibo:
We have presented a semi-supervised approach to training bidirectional neural machine translation models

Abstract:

While end-to-end neural machine translation (NMT) has made remarkable progress recently, NMT systems only rely on parallel corpora for parameter estimation. Since parallel corpora are usually limited in quantity, quality, and coverage, especially for low-resource languages, it is appealing to exploit monolingual corpora to improve NMT. We...More

Code:

Data:

Introduction
  • End-to-end neural machine translation (NMT), which leverages a single, large neural network to directly transform a source-language sentence into a target-language sentence, has attracted increasing attention in recent several years (Kalchbrenner and Blunsom, 2013; Sutskever et al, 2014; Bahdanau et al, 2015).
  • Most existing NMT approaches suffer from a major drawback: they heavily rely on parallel corpora for training translation models
  • This is because NMT directly models the probability of a target-language sentence given a source-language sentence and does not have a separate language model like SMT (Kalchbrenner and Blunsom, 2013; Sutskever et al, 2014; Bahdanau et al, 2015).
  • The unavailability of large-scale, high-quality, and wide-coverage parallel corpora hinders the applicability of NMT
Highlights
  • End-to-end neural machine translation (NMT), which leverages a single, large neural network to directly transform a source-language sentence into a target-language sentence, has attracted increasing attention in recent several years (Kalchbrenner and Blunsom, 2013; Sutskever et al, 2014; Bahdanau et al, 2015)
  • Free of latent structure design and feature engineering that are critical in conventional statistical machine translation (SMT) (Brown et al, 1993; Koehn et al, 2003; Chiang, 2005), neural machine translation has proven to excel in model-
  • Most existing neural machine translation approaches suffer from a major drawback: they heavily rely on parallel corpora for training translation models
  • We have presented a semi-supervised approach to training bidirectional neural machine translation models
  • Experiments on ChineseEnglish NIST datasets show that our approach leads to significant improvements
  • As our method is sensitive to the OOVs present in monolingual corpora, we plan to integrate Jean et al (2015)’s technique on using very large vocabulary into our approach
Methods
  • C →E E →C C →E E →C C →E E →C nese monolingual corpus leads to more benefits to English-to-Chinese translation than adding English monolingual corpus.
  • Adding target monolingual corpus improves over using only parallel corpus for source-totarget translation; 2.
  • Adding source monolingual corpus improves over using only parallel corpus for source-to-target translation, but the improvements are smaller than adding target monolingual corpus; 3
  • Adding both source and target monolingual corpora does not lead to further significant improvements
Results
  • The BLEU scores are case-insensitive. “*”: significantly better than MOSES (p < 0.05); “**”: significantly better than MOSES (p < 0.01);“+”: significantly better than RNNSEARCH (p < 0.05); “++”: significantly better than RNNSEARCH (p < 0.01).
Conclusion
  • The authors have presented a semi-supervised approach to training bidirectional neural machine translation models.
  • As the method is sensitive to the OOVs present in monolingual corpora, the authors plan to integrate Jean et al (2015)’s technique on using very large vocabulary into the approach.
  • It is necessary to further validate the effectiveness of the approach on more language pairs and NMT architectures.
  • Another interesting direction is to enhance the connection between source-to-target and target-tosource models to help them benefit more from interacting with each other
Summary
  • Introduction:

    End-to-end neural machine translation (NMT), which leverages a single, large neural network to directly transform a source-language sentence into a target-language sentence, has attracted increasing attention in recent several years (Kalchbrenner and Blunsom, 2013; Sutskever et al, 2014; Bahdanau et al, 2015).
  • Most existing NMT approaches suffer from a major drawback: they heavily rely on parallel corpora for training translation models
  • This is because NMT directly models the probability of a target-language sentence given a source-language sentence and does not have a separate language model like SMT (Kalchbrenner and Blunsom, 2013; Sutskever et al, 2014; Bahdanau et al, 2015).
  • The unavailability of large-scale, high-quality, and wide-coverage parallel corpora hinders the applicability of NMT
  • Methods:

    C →E E →C C →E E →C C →E E →C nese monolingual corpus leads to more benefits to English-to-Chinese translation than adding English monolingual corpus.
  • Adding target monolingual corpus improves over using only parallel corpus for source-totarget translation; 2.
  • Adding source monolingual corpus improves over using only parallel corpus for source-to-target translation, but the improvements are smaller than adding target monolingual corpus; 3
  • Adding both source and target monolingual corpora does not lead to further significant improvements
  • Results:

    The BLEU scores are case-insensitive. “*”: significantly better than MOSES (p < 0.05); “**”: significantly better than MOSES (p < 0.01);“+”: significantly better than RNNSEARCH (p < 0.05); “++”: significantly better than RNNSEARCH (p < 0.01).
  • Conclusion:

    The authors have presented a semi-supervised approach to training bidirectional neural machine translation models.
  • As the method is sensitive to the OOVs present in monolingual corpora, the authors plan to integrate Jean et al (2015)’s technique on using very large vocabulary into the approach.
  • It is necessary to further validate the effectiveness of the approach on more language pairs and NMT architectures.
  • Another interesting direction is to enhance the connection between source-to-target and target-tosource models to help them benefit more from interacting with each other
Tables
  • Table1: Characteristics of parallel and monolingual corpora
  • Table2: Comparison with MOSES and RNNSEARCH. MOSES is a phrase-based statistical machine translation system (<a class="ref-link" id="cKoehn_et+al_2007_a" href="#rKoehn_et+al_2007_a">Koehn et al, 2007</a>). RNNSEARCH is an attention-based neural machine translation system (Bahdanau et al, monolingual corpus, and
  • Table3: Comparison with <a class="ref-link" id="cSennrich_et+al_2015_a" href="#rSennrich_et+al_2015_a"><a class="ref-link" id="cSennrich_et+al_2015_a" href="#rSennrich_et+al_2015_a"><a class="ref-link" id="cSennrich_et+al_2015_a" href="#rSennrich_et+al_2015_a">Sennrich et al (2015</a></a></a>). Both <a class="ref-link" id="cSennrich_et+al_2015_a" href="#rSennrich_et+al_2015_a"><a class="ref-link" id="cSennrich_et+al_2015_a" href="#rSennrich_et+al_2015_a"><a class="ref-link" id="cSennrich_et+al_2015_a" href="#rSennrich_et+al_2015_a">Sennrich et al (2015</a></a></a>) and our approach build on top of RNNSEARCH to exploit monolingual corpora. The BLEU scores are case-insensitive. “*”: significantly better than <a class="ref-link" id="cSennrich_et+al_2015_a" href="#rSennrich_et+al_2015_a"><a class="ref-link" id="cSennrich_et+al_2015_a" href="#rSennrich_et+al_2015_a"><a class="ref-link" id="cSennrich_et+al_2015_a" href="#rSennrich_et+al_2015_a">Sennrich et al (2015</a></a></a>) (p < 0.05); “**”: significantly better than <a class="ref-link" id="cSennrich_et+al_2015_a" href="#rSennrich_et+al_2015_a"><a class="ref-link" id="cSennrich_et+al_2015_a" href="#rSennrich_et+al_2015_a"><a class="ref-link" id="cSennrich_et+al_2015_a" href="#rSennrich_et+al_2015_a">Sennrich et al (2015</a></a></a>) (p < 0.01)
  • Table4: Example translations of sentences in the monolingual corpus during semi-supervised learning. We find our approach is capable of generating better translations of the monolingual corpus over time
Download tables as Excel
Related work
  • Our work is inspired by two lines of research: (1) exploiting monolingual corpora for machine translation and (2) autoencoders in unsupervised and semi-supervised learning.

    4.1 Exploiting Monolingual Corpora for Machine Translation

    Exploiting monolingual corpora for conventional SMT has attracted intensive attention in recent years. Several authors have introduced transductive learning to make full use of monolingual corpora (Ueffing et al, 2007; Bertoldi and Federico, 2009). They use an existing translation model to translate unseen source text, which can be paired with its translations to form a pseudo parallel corpus. This process iterates until convergence. While Klementiev et al (2012) propose an approach to estimating phrase translation probabilities from monolingual corpora, Zhang and Zong (2013) directly extract parallel phrases from monolingual corpora using retrieval techniques. Another important line of research is to treat translation on monolingual corpora as a decipherment problem (Ravi and Knight, 2011; Dou et al, 2014).
Funding
  • This research is supported by the 973 Program (2014CB340501, 2014CB340505), the National Natural Science Foundation of China (No 61522204, 61331013, 61361136003), 1000 Talent Plan grant, Tsinghua Initiative Research Program grants 20151080475 and a Google Faculty Research Award
Reference
  • Waleed Ammar, Chris Dyer, and Noah Smith. 2014. Conditional random field autoencoders for unsupervised structred prediction. In Proceedings of NIPS 2014.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR.
    Google ScholarLocate open access versionFindings
  • Nicola Bertoldi and Marcello Federico. 2009. Domain adaptation for statistical machine translation. In Proceedings of WMT.
    Google ScholarLocate open access versionFindings
  • Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguisitics.
    Google ScholarFindings
  • David Chiang. 200A hierarchical phrase-based model for statistical machine translation. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Kyunhyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST-8.
    Google ScholarLocate open access versionFindings
  • Andrew M. Dai and Quoc V. Le. 2015. Semisupervised sequence learning. In Proceedings of NIPS.
    Google ScholarLocate open access versionFindings
  • Qing Dou, Ashish Vaswani, and Kevin Knight. 2014. Beyond parallel data: Joint word alignment and decipherment improves machine translation. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Caglar Gulccehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loıc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. arXiv:1503.03535 [cs.CL].
    Findings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguisitics.
    Google ScholarFindings
  • Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Alexandre Klementiev, Ann Irvine, Chris CallisonBurch, and David Yarowsky. 2012. Toward statistical machine translation without paralel corpora. In Proceedings of EACL.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of ACL (demo session).
    Google ScholarLocate open access versionFindings
  • Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Franz Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a methof for automatic evaluation of machine translation. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Sujith Ravi and Kevin Knight. 2011. Deciphering foreign language. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving nerual machine translation models with monolingual data. arXiv:1511.06709 [cs.CL].
    Findings
  • Richard Socher, Jeffrey Pennington, Eric Huang, Andrew Ng, and Christopher Manning. 2011. Semisupervised recursive autoencoders for predicting sentiment distributions. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Andreas Stolcke. 2002. Srilm - am extensible language modeling toolkit. In Proceedings of ICSLP.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of NIPS.
    Google ScholarLocate open access versionFindings
  • Nicola Ueffing, Gholamreza Haffari, and Anoop Sarkar. 2007. Trasductive learning for statistical machine translation. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Autoine Manzagol. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research.
    Google ScholarLocate open access versionFindings
  • Jiajun Zhang and Chengqing Zong. 2013. Learning a phrase-based translation model from monolingual data with application to domain adaptation. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments