Optimizing Non-Decomposable Evaluation Metrics for Neural Machine Translation

J. Comput. Sci. Technol., Volume 32, Issue 4, 2017, Pages 796-804.

Cited by: 1|Bibtex|Views42|Links
EI
Keywords:
neural machine translation training criterion non-decomposable evaluation metric
Weibo:
We proposed an approach to training neural machine translation models with nondecomposable evaluation metrics

Abstract:

While optimizing model parameters with respect to evaluation metrics has recently proven to benefit end to-end neural machine translation (NMT), the evaluation metrics used in the training are restricted to be defined at the sentence level to facilitate online learning algorithms. This is undesirable because the final evaluation metrics u...More

Code:

Data:

Introduction
  • The past several years have witnessed the rapid development of end-to-end neural machine translation (NMT)[1,2,3].
  • Shi-Qi Shen et al.: Optimizing Non-Decomposable Evaluation Metrics for Neural Machine Translation into NMT to optimize model parameters with respect to evaluation metrics such as BLEU[14] and TER[15]
  • Their experiments show that optimizing NMT models with respect to evaluation metrics leads to significant improvements over maximum likelihood estimation.
Highlights
  • The past several years have witnessed the rapid development of end-to-end neural machine translation (NMT)[1,2,3]
  • Shi-Qi Shen et al.: Optimizing Non-Decomposable Evaluation Metrics for Neural Machine Translation into NMT to optimize model parameters with respect to evaluation metrics such as BLEU[14] and TER[15]. Their experiments show that optimizing NMT models with respect to evaluation metrics leads to significant improvements over maximum likelihood estimation
  • We propose to optimize NMT model parameters with respect to non-decomposable evaluation metrics
  • We proposed an approach to training neural machine translation models with nondecomposable evaluation metrics
  • Experiments showed that our approach is capable of improving the correlation between training and testing and significantly outperforms minimum risk training with decomposable evaluation metrics
  • Experiments on Chinese-English and English-French translation show that our approach improves the correlation between training and testing and significantly outperforms the minimum risk training (MRT) algorithm using decomposable evaluation metrics
  • As our approach is transparent to network architectures and evaluation metrics, it can potentially benefit more natural language processing tasks
Methods
  • 4.1 Setup

    The authors evaluated the approach on two translation tasks: Chinese-English and English-French.
  • For English-French translation tasks, to compare with the results reported by previous work on end-toend NMT[2,3,9,22,23,24], the authors used the same subset of the WMT 2014 training corpus that contains 12M sentence pairs, which contains 304M English words and 348M French words, respectively.
  • The concatenation of news-test 2012 and news-test 2013 serves as the validation set and news-test 2014 as the test set
Results
  • Table 4 shows the results on English-French translation.
  • The authors list existing end-to-end NMT systems that are comparable to the system.
  • All these systems use the same subset of the WMT 2014 parallel training corpus.
  • They differ in network architectures, vocabulary sizes and training criterion.
  • ○3 Some results are from the arxiv version.
  • The authors' approach does not assume specific architectures and can in principle be applied to any NMT systems○3
Conclusion
  • The authors proposed an approach to training neural machine translation models with nondecomposable evaluation metrics.
  • The basic idea is to calculate the expectations of corpus-level metrics on a subset of the training data to allow online training algorithms.
  • Experiments showed that the approach is capable of improving the correlation between training and testing and significantly outperforms minimum risk training with decomposable evaluation metrics.
  • As the approach is transparent to network architectures and evaluation metrics, it can potentially benefit more natural language processing tasks
Summary
  • Introduction:

    The past several years have witnessed the rapid development of end-to-end neural machine translation (NMT)[1,2,3].
  • Shi-Qi Shen et al.: Optimizing Non-Decomposable Evaluation Metrics for Neural Machine Translation into NMT to optimize model parameters with respect to evaluation metrics such as BLEU[14] and TER[15]
  • Their experiments show that optimizing NMT models with respect to evaluation metrics leads to significant improvements over maximum likelihood estimation.
  • Objectives:

    The authors' goal is to include corpus-level evaluation metrics in the training while retaining the benefits of online training.
  • Methods:

    4.1 Setup

    The authors evaluated the approach on two translation tasks: Chinese-English and English-French.
  • For English-French translation tasks, to compare with the results reported by previous work on end-toend NMT[2,3,9,22,23,24], the authors used the same subset of the WMT 2014 training corpus that contains 12M sentence pairs, which contains 304M English words and 348M French words, respectively.
  • The concatenation of news-test 2012 and news-test 2013 serves as the validation set and news-test 2014 as the test set
  • Results:

    Table 4 shows the results on English-French translation.
  • The authors list existing end-to-end NMT systems that are comparable to the system.
  • All these systems use the same subset of the WMT 2014 parallel training corpus.
  • They differ in network architectures, vocabulary sizes and training criterion.
  • ○3 Some results are from the arxiv version.
  • The authors' approach does not assume specific architectures and can in principle be applied to any NMT systems○3
  • Conclusion:

    The authors proposed an approach to training neural machine translation models with nondecomposable evaluation metrics.
  • The basic idea is to calculate the expectations of corpus-level metrics on a subset of the training data to allow online training algorithms.
  • Experiments showed that the approach is capable of improving the correlation between training and testing and significantly outperforms minimum risk training with decomposable evaluation metrics.
  • As the approach is transparent to network architectures and evaluation metrics, it can potentially benefit more natural language processing tasks
Tables
  • Table1: Effect of Evaluation Metrics on Translation Quality on the Validation Set
  • Table2: Case-Insensitive BLEU Scores on the Test Sets
  • Table3: Case-Insensitive TER Scores on the Test Sets
  • Table4: Comparison with Previous Work on English-French Translation
Download tables as Excel
Related work
  • Our work is closely related to minimum risk training widely used in statistical machine translation. The minimum error rate training (MERT) algorithm[4] is a special form of MRT. Although MERT is capable of optimizing models with respect to non-decomposable evaluation metrics, it is restricted to optimizing linear models with tens of features on a small development set. [12] proposes an approach to maximizing expected BLEU for training phrase and lexicon translation models. It uses the extended Baum-Welch algorithm to efficiently update model parameters. These approaches cannot be directly applied to neural machine transla-

    System Existing end-to-end NMT systems

    Our end-to-end NMT system

    RNNSearch[3] LSTM with 4 layers[2] RNNSearch + PosUnk[22] LSTM with 6 layers + PosUnk[23] RNNSearch + PosUnk[9] RNNSearch + monolingual data + PosUnk[24]

    RNNSearch + PosUnk

    Training MLE MLE MLE MLE sMRT Dual learning cMRT

    Note: The BLEU scores are case-sensitive. “PosUnk” denotes the technique of handling rare words in [23].

    BLEU 28.45 30.59 33.08 32.70 34.23 34.83 34.93 tion because of the non-linearity in neural networks. Neural machine translation needs efficient online learning algorithms because the training dataset is always very large. As it is difficult to directly optimize models with respect to non-decomposable evaluation metrics in online learning frameworks, one possible solution is to maintain a large buffer to compute online gradient estimates that can be prohibitive[20]. [21] considers optimizing for the performance measures that are concave or pseudo-linear in the canonical confusion matrix of the predictor. A key limitation of these approaches is that they only focus on classification tasks. It is non-trivial to adapt these approaches to optimize non-decomposable evaluation metrics for neural machine translation.
Funding
  • This work is supported by the National Natural Science Foundation of China under Grant Nos. 61522204, 61432013, and the National High Technology Research and Development 863 Program of China under Grant No 2015AA015407, also supported by the Singapore National Research Foundation under Its International Research Centre@Singapore Funding Initiative, and administered by the IDM (Interactive Digital Media) Programme
Reference
  • Kalchbrenner N, Blunsom P. Recurrent continuous translation models. In Proc. the Conference on Empirical Methods in Natural Language Processing, Oct. 2013, pp.1700-1709.
    Google ScholarLocate open access versionFindings
  • Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems, Dec. 2014, pp.3104-3112.
    Google ScholarLocate open access versionFindings
  • Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In Proc. ICLR, May 2015.
    Google ScholarLocate open access versionFindings
  • Och F J. Minimum error rate training in statistical machine translation. In Proc. the 41st Annual Meeting of the Association for Computational Linguistics, July 2003, pp.160-167.
    Google ScholarLocate open access versionFindings
  • Chiang D. A hierarchical phrase-based model for statistical machine translation. In Proc. the 43rd Annual Meeting of the Association for Computational Linguistics, June 2005, pp.263-270.
    Google ScholarLocate open access versionFindings
  • Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735-1780.
    Google ScholarLocate open access versionFindings
  • Chung J, Gulcehre C, Cho K, Yoshua B. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014. https://arxiv.org/abs/1412.3555, May 2017.
    Findings
  • Ranzato M, Chopra S, Auli M, Zaremba W. Sequence level training with recurrent neural networks. In Proc. ICLR, May 2016.
    Google ScholarLocate open access versionFindings
  • Shen S, Cheng Y, He Z, He W, Wu H, Sun M, Liu Y. Minimum risk training for neural machine translation. In Proc. the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug. 2016, pp.1683-1692.
    Google ScholarLocate open access versionFindings
  • Willams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8(3/4): 229-256.
    Google ScholarLocate open access versionFindings
  • Smith D A, Eisner J. Minimum risk annealing for training log-linear models. In Proc. the COLING/ACL on Main Conference Poster Sessions, July 2006, pp.787-794.
    Google ScholarLocate open access versionFindings
  • He X, Deng L. Maximum expected BLEU training of phrase and lexicon translation models. In Proc. the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 2012, pp.292-301.
    Google ScholarLocate open access versionFindings
  • Gao J, He X, Yih W, Deng L. Learning continuous phrase representations for translation modeling. In Proc. the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), June 2014, pp.699-709.
    Google ScholarLocate open access versionFindings
  • Papineni K, Roukos S, Ward T, Zhu W J. BLEU: A method for automatic evaluation of machine translation. In Proc. the 40th Annual Meeting of the Association for Computational Linguistics, July 2002, pp.311-318.
    Google ScholarLocate open access versionFindings
  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J. A study of translation edit rate with targeted human annotation. In Proc. the 7th Association for Machine Translation in the Americas, Aug. 2006, pp.223-231.
    Google ScholarLocate open access versionFindings
  • Watanabe T, Suzuki J, Tsukada H, Isozaki H. Online largemargin training for statistical machine translation. In Proc. the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), June 2007, pp.764-773.
    Google ScholarLocate open access versionFindings
  • Chiang D, Marton Y, Resnik P. Online large-margin training of syntactic and structural translation features. In Proc. the Conference on Empirical Methods in Natural Language Processing, Oct. 2008, pp.224-233.
    Google ScholarLocate open access versionFindings
  • Chiang D. Hope and fear for discriminative training of statistical translation models. The Journal of Machine Learning Research, 2012, 13(1): 1159-1187.
    Google ScholarLocate open access versionFindings
  • Neubig G, Watanabe T. Optimization for statistical machine translation: A survey. Computational Linguistics, 2016, 42(2): 1-54.
    Google ScholarLocate open access versionFindings
  • Kar P, Narasimhan H, Jain P. Online and stochastic gradient methods for non-decomposable loss functions. In Proc. the 27th Advances in Neural Information Processing Systems, Dec. 2014, pp.694-702.
    Google ScholarLocate open access versionFindings
  • Narasimhan H, Vaish R, Agarwal S. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In Proc. the 27th Advances in Neural Information Processing Systems, Dec. 2014, pp.1493-1501.
    Google ScholarLocate open access versionFindings
  • Jean S, Cho K, Memisevic R, Bengio Y. On using very large target vocabulary for neural machine translation. In Proc. the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), July 2015, pp.1-10.
    Google ScholarLocate open access versionFindings
  • Luong M T, Sutskever I, Le Q V, Vinyals O, Zaremba W. Addressing the rare word problem in neural machine translation. In Proc. the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), July 2015, pp.11-19.
    Google ScholarLocate open access versionFindings
  • He D, Xia Y, Qin T, Wang L, Yu N, Liu T, Ma W Y. Dual learning for machine translation. In Proc. the 30th Advances in Neural Information Processing Systems, Dec. 2016, pp.820-828.
    Google ScholarLocate open access versionFindings
  • Koehn P. Statistical significance tests for machine translation evaluation. In Proc. the Conference on Empirical Methods in Natural Language Processing, July 2004, pp.388-395. Yang Liu is an associate professor in the Department of Computer Science and Technology at Tsinghua University, Beijing. He received his Ph.D. degree from Institute of Computing Technology, Chinese Academy of Sciences, Beijing, in 2007. His research areas include natural language processing and machine translation. text text
    Google ScholarFindings
  • Mao-Song Sun is a professor at the Department of Computer Science and Technology in Tsinghua University, Beijing. He received his Ph.D. degree in computational linguistics from City University of Hong Kong, Hong Kong, in 2004. His research interests include natural language processing, Web intelligence, and machine learning.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments