Agreement-based Joint Training for Bidirectional Attention-based Neural Machine Translation

IJCAI, 2016.

Cited by: 57|Bibtex|Views61|Links
EI
Keywords:
Square of subtractionindependent trainingSquare of additionword alignmentalignment error rateMore(14+)
Weibo:
2 We find that RNNSEARCH generally outperforms MOSES except for the C ! E direction on the NIST08 test set, which confirms the effectiveness of attention-based neural machine translation on distantly-related language pairs such as Chinese and English

Abstract:

The attentional mechanism has proven to be effective in improving end-to-end neural machine translation. However, due to the intricate structural divergence between natural languages, unidirectional attention-based models might only capture partial aspects of attentional regularities. We propose agreement-based joint training for bidirect...More

Code:

Data:

0
Introduction
  • End-to-end neural machine translation (NMT) is a newly proposed paradigm for machine translation [Kalchbrenner and Blunsom, 2013; Cho et al, 2014; Sutskever et al, 2014; Bahdanau et al, 2015].
  • While early NMT models encode a source sentence as a fixed-length vector, Bahadanau et al [2015] advocate the use of attention in NMT
  • They indicate that only parts of the source sentence have an effect on the target word being generated.
  • The relevant parts often vary with different target words
  • Such an attentional mechanism has proven to be an effective technique in text generation tasks such as machine translation [Bahdanau et al, 2015;.
  • The encoder-decoder framework [Kalchbrenner and Blunsom, 2013; Cho et al, 2014; Sutskever et al, 2014; Bahdanau et al, 2015] usually uses a recurrent neural network (RNN)
Highlights
  • End-to-end neural machine translation (NMT) is a newly proposed paradigm for machine translation [Kalchbrenner and Blunsom, 2013; Cho et al, 2014; Sutskever et al, 2014; Bahdanau et al, 2015]
  • Without explicitly modeling latent structures that are vital for conventional statistical machine translation (SMT) [Brown et al, 1993; Koehn et al, 2003; Chiang, 2005], neural machine translation builds on an encoder-decoder framework: the encoder transforms a source-language sentence into a continuous-space representation, from which the decoder generates a target-language sentence
  • We propose to introduce agreement-based learning [Liang et al, 2006; 2007] into attention-based neural machine translation
  • 2 We find that RNNSEARCH generally outperforms MOSES except for the C ! E direction on the NIST08 test set, which confirms the effectiveness of attention-based neural machine translation on distantly-related language pairs such as Chinese and English
  • We have presented agreement-based joint training for bidirectional attention-based neural machine translation
  • By encouraging bidirectional models to agree on parametrized alignment matrices, joint learning achieves significant improvements in terms of alignment and translation quality over independent training
Methods
  • For Chinese-English, the training corpus from LDC consists of 2.56M sentence pairs with 67.53M Chinese words and 74.81M English words.
  • The NIST 2002, 2003, 2004, 2005, and 2008 datasets were used as test sets.
  • In the NIST Chinese-English datasets, each Chinese sentence has four reference English translations.
  • To build English-Chinese validation and test sets, we “reverse” the Chinese-English datasets: the first English sentence in the four references as the source sentence and the Chinese sentence as the single reference translation
Results
  • Results on Chinese

    English Translation

    Table 2 shows the results on the Chinese-to-English (C ! E) and English-to-Chinese (E ! C) translation tasks. 2 The authors find that RNNSEARCH generally outperforms MOSES except for the C ! E direction on the NIST08 test set, which confirms the effectiveness of attention-based NMT on distantly-related language pairs such as Chinese and English.

    Agreement-based joint training further systematically improves the translation quality in both directions over independently training except for the E ! C direction on the NIST04 test set.
  • The authors find that agreement-based joint training significantly reduces alignment errors for both directions as compared with independent training
  • This suggests that introducing agreement does enable NMT to capture attention more accurately and lead to better translations.
  • While RNNSEARCH with independent training achieves translation performance on par with MOSES, agreement-based joint learning leads to significant improvements over both baselines.
  • This suggests that the approach is general and can be applied to more language pairs
Conclusion
  • The authors have presented agreement-based joint training for bidirectional attention-based neural machine translation.
  • The authors plan to further validate the effectiveness of the approach on more language pairs
Summary
  • Introduction:

    End-to-end neural machine translation (NMT) is a newly proposed paradigm for machine translation [Kalchbrenner and Blunsom, 2013; Cho et al, 2014; Sutskever et al, 2014; Bahdanau et al, 2015].
  • While early NMT models encode a source sentence as a fixed-length vector, Bahadanau et al [2015] advocate the use of attention in NMT
  • They indicate that only parts of the source sentence have an effect on the target word being generated.
  • The relevant parts often vary with different target words
  • Such an attentional mechanism has proven to be an effective technique in text generation tasks such as machine translation [Bahdanau et al, 2015;.
  • The encoder-decoder framework [Kalchbrenner and Blunsom, 2013; Cho et al, 2014; Sutskever et al, 2014; Bahdanau et al, 2015] usually uses a recurrent neural network (RNN)
  • Methods:

    For Chinese-English, the training corpus from LDC consists of 2.56M sentence pairs with 67.53M Chinese words and 74.81M English words.
  • The NIST 2002, 2003, 2004, 2005, and 2008 datasets were used as test sets.
  • In the NIST Chinese-English datasets, each Chinese sentence has four reference English translations.
  • To build English-Chinese validation and test sets, we “reverse” the Chinese-English datasets: the first English sentence in the four references as the source sentence and the Chinese sentence as the single reference translation
  • Results:

    Results on Chinese

    English Translation

    Table 2 shows the results on the Chinese-to-English (C ! E) and English-to-Chinese (E ! C) translation tasks. 2 The authors find that RNNSEARCH generally outperforms MOSES except for the C ! E direction on the NIST08 test set, which confirms the effectiveness of attention-based NMT on distantly-related language pairs such as Chinese and English.

    Agreement-based joint training further systematically improves the translation quality in both directions over independently training except for the E ! C direction on the NIST04 test set.
  • The authors find that agreement-based joint training significantly reduces alignment errors for both directions as compared with independent training
  • This suggests that introducing agreement does enable NMT to capture attention more accurately and lead to better translations.
  • While RNNSEARCH with independent training achieves translation performance on par with MOSES, agreement-based joint learning leads to significant improvements over both baselines.
  • This suggests that the approach is general and can be applied to more language pairs
  • Conclusion:

    The authors have presented agreement-based joint training for bidirectional attention-based neural machine translation.
  • The authors plan to further validate the effectiveness of the approach on more language pairs
Tables
  • Table1: Comparison of loss functions in terms of caseinsensitive BLEU scores on the validation set for Chineseto-English translation
  • Table2: Results on the Chinese-English translation task. MOSES is a phrase-based statistical machine translation system
  • Table3: Results on the Chinese-English word alignment task. The evaluation metric is alignment error rate. “**”: significantly better than RNNSEARCH with independent training (p < 0.01)
  • Table4: Comparison of independent and joint training in terms of average attention entropy (see Eq (15)) on Chineseto-English translation
  • Table5: Results on the English-French translation task. The BLEU scores are case-insensitive. “**”: significantly better than MOSES (p < 0.01); “++”: significantly better than RNNSEARCH with independent training (p < 0.01)
Download tables as Excel
Related work
  • Our work is inspired by two lines of research: (1) attentionbased NMT and (2) agreement-based learning.

    5.1 Attention-based Neural Machine Translation

    Bahdanau et al [2015] first introduce the attentional mechanism into neural machine translation to enable the decoder to focus on relevant parts of the source sentence during decoding. The attention mechanism allows a neural model to cope better with long sentences because it does not need to encode all the information of a source sentence into a fixed-length vector regardless of its length. In addition, the attentional mechanism allows us to look into the “black box” to gain insights on how NMT works from a linguistic perspective.

    Luong et al [2015a] propose two simple and effective attentional mechanisms for neural machine translation and compare various alignment functions. They show that attention-based NMT are superior to non-attentional models in translating names and long sentences.

    After analyzing the alignment matrices generated by RNNSEARCH [Bahdanau et al, 2015], we find that modeling the structural divergence of natural languages is so challenging that unidirectional models can only capture part of alignment regularities. This finding inspires us to improve attention-based NMT by combining two unidirectional models. In this work, we only apply agreement-based joint learning to RNNSEARCH. As our approach does not assume specific network architectures, it is possible to apply it to the models proposed by Luong et al [2015a].
Funding
  • This research is supported by the 973 Program (2014CB340501, 2014CB340505), the National Natural Science Foundation of China (No 61522204, 61331013, 61361136003), 1000 Talent Plan grant, Tsinghua Initiative Research Program grants 20151080475 and a Google Faculty Research Award
Reference
  • [Bahdanau et al., 2015] Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • [Brown et al., 1993] Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguisitics, 1993.
    Google ScholarLocate open access versionFindings
  • [Chiang, 2005] David Chiang. A hierarchical phrase-based model for statistical machine translation. In Proceedings of ACL, 2005.
    Google ScholarLocate open access versionFindings
  • [Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • [Ganchev et al., 2010] Kuzman Ganchev, Joao Graca, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for structured latent variable models. The Journal of Machine Learning Research, 11:2001–2049, 2010.
    Google ScholarLocate open access versionFindings
  • [Jean et al., 2015] Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. In Proceedings of ACL, 2015.
    Google ScholarLocate open access versionFindings
  • [Kalchbrenner and Blunsom, 2013] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings of EMNLP, 2013.
    Google ScholarLocate open access versionFindings
  • [Koehn and Hoang, 2007] Philipp Koehn and Hieu Hoang. Factored translation models. In Proceedings of EMNLP, 2007.
    Google ScholarLocate open access versionFindings
  • [Koehn et al., 2003] Philipp Koehn, Franz J. Och, and Daniel Marcu. Statistical phrase-based translation. In Proceedings of HLT-NAACL, 2003.
    Google ScholarLocate open access versionFindings
  • [Koehn, 2004] Philipp Koehn. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP, 2004.
    Google ScholarLocate open access versionFindings
  • [Levinboim et al., 2015] Tomer Levinboim, Ashish Vaswani, and David Chiang. Model invertibility regularization: Sequence alignment with or without parallel data. In Proceedings of NAACL, 2015.
    Google ScholarLocate open access versionFindings
  • [Liang et al., 2006] Percy Liang, Ben Taskar, and Dan Klein. Alignment by agreement. In Proceedings of NAACL, 2006.
    Google ScholarLocate open access versionFindings
  • [Liang et al., 2007] Percy Liang, Dan Klein, and Michael I. Jordan. Agreement-based learning. In Proceedings of NIPS, 2007.
    Google ScholarLocate open access versionFindings
  • [Liu and Sun, 2015] Yang Liu and Maosong Sun. Contrastive unsupervised word alignment with non-local features. In Proceedings of AAAI, 2015.
    Google ScholarLocate open access versionFindings
  • [Liu et al., 2015] Chunyang Liu, Yang Liu, Huanbo Luan, Maosong Sun, and Heng Yu. Generalized agreement for bidirectional word alignment. In Proceedings of EMNLP, 2015.
    Google ScholarLocate open access versionFindings
  • [Luong et al., 2015a] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of EMNLP, 2015.
    Google ScholarLocate open access versionFindings
  • [Luong et al., 2015b] Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Addressing the rare word problem in neural machine translation. In Proceedings of ACL, 2015.
    Google ScholarLocate open access versionFindings
  • [Stolcke, 2002] Andreas Stolcke. Srilm - an extensible language modeling toolkit. In Proceedings of ICSLP, 2002.
    Google ScholarLocate open access versionFindings
  • [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Proceedings of NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • [Xu et al., 2015] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, KyungHyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of ICML, 2015.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments