Improved Neural Machine Translation with SMT Features

AAAI, pp. 151-157, 2016.

Cited by: 72|Bibtex|Views35|Links
EI
Keywords:
recurrent neural networkencoder decoderlanguage modelneural machine translationstochastic gradient descentMore(13+)
Weibo:
We observed that the proposed method significantly improves the translation quality of the conventional neural machine translation system

Abstract:

Neural machine translation (NMT) conducts end-to-end translation with a source language encoder and a target language decoder, making promising translation performance. However, as a newly emerged approach, the method has some limitations. An NMT system usually has to apply a vocabulary of certain size to avoid the time-consuming training...More

Code:

Data:

0
Introduction
  • Neural networks have recently been applied to machine translation and begun to show promising results. Sutskever, Vinyals, and Le (2014) and Bahdanau, Cho, and Bengio (2014) directly built neural networks to perform end-toend translation, named neural machine translation (NMT).
  • An NMT system contains two components, an encoder that converts a source sentence into a vector, and a decoder that generates target translation based on the vector.
  • This section briefly reviews the RNN encoder-decoder, a recently proposed NMT approach based on recurrent neural network, and the log-linear models, the dominant framework for SMT in the last decade.
  • Given a source sentence f = f1, f2, ..., fI , the encoder first encodes finto a sequence of vectors, the decoder generates the target translation e = e1, e2, ..., eJ based on the vectors and the target words previously generated.
Highlights
  • Neural networks have recently been applied to machine translation and begun to show promising results. Sutskever, Vinyals, and Le (2014) and Bahdanau, Cho, and Bengio (2014) directly built neural networks to perform end-toend translation, named neural machine translation (NMT)
  • We propose to improve NMT by integrating statistical machine translation (SMT) features with the NMT model under the log-linear framework
  • We observed that the proposed method significantly improves the translation quality of the conventional NMT system
  • The translation table is trained on word-aligned bilingual corpus via the standard phrase-based SMT method, and the language model is trained on monolingual target sentences
  • Experiments on Chinese-to-English translation tasks show that our system achieves significant improvements over the baseline on large amount of the training corpus crawled from the web
  • We plan to improve NMT with phrase pairs, which are good at capturing local word reordering, idiom translation, etc
Methods
  • The authors carried out experiments on Chinese-to-English translation.
  • The training corpora are automatically crawled from the web, containing about 2.2 billion Chinese words and 2.3 billion English words.
  • The authors used NIST MT06 as the development set and tested the system on NIST MT08.
  • The evaluation metric is caseinsensitive BLEU-4 (Papineni et al, 2002).
  • The feature weights of the translation system are tuned with the standard minimum-error-rate-training (MERT) (Och 2003) to maximize the systems’ BLEU score on the development set
Results
  • The authors observed that the proposed method significantly improves the translation quality of the conventional NMT system.
  • By adding the word translation table and the word reward features, the method obtained significant improvements over the baseline.
  • The average lengths of the outputs on the test set of the system and GroundHog are 23.5 and 21.4, respectively.
  • This indicates that the method alleviates the inadequate translation problem.
  • Further analyses and discussions will be described
Conclusion
  • In order to further study the performance of the proposed method, the authors compared the outputs of the systems.

    Improving Lexical Translation

    Taking the first sentence in Table 2 as an example, the GroundHog system omits the translation of “传输(chuanshu) transmission”.
  • This can be attributed to the fact that the translation table consists of word pairs with translation probabilities estimated from the word-aligned training corpus, providing another way to measure the relationship between source and target words
  • In this example, the translation table contains the word pairs “chuanshu, transmission” with high probabilities.Conclusion and Future Work.
  • The translation table is trained on word-aligned bilingual corpus via the standard phrase-based SMT method, and the language model is trained on monolingual target sentences.
  • The authors plan to improve NMT with phrase pairs, which are good at capturing local word reordering, idiom translation, etc
Summary
  • Introduction:

    Neural networks have recently been applied to machine translation and begun to show promising results. Sutskever, Vinyals, and Le (2014) and Bahdanau, Cho, and Bengio (2014) directly built neural networks to perform end-toend translation, named neural machine translation (NMT).
  • An NMT system contains two components, an encoder that converts a source sentence into a vector, and a decoder that generates target translation based on the vector.
  • This section briefly reviews the RNN encoder-decoder, a recently proposed NMT approach based on recurrent neural network, and the log-linear models, the dominant framework for SMT in the last decade.
  • Given a source sentence f = f1, f2, ..., fI , the encoder first encodes finto a sequence of vectors, the decoder generates the target translation e = e1, e2, ..., eJ based on the vectors and the target words previously generated.
  • Methods:

    The authors carried out experiments on Chinese-to-English translation.
  • The training corpora are automatically crawled from the web, containing about 2.2 billion Chinese words and 2.3 billion English words.
  • The authors used NIST MT06 as the development set and tested the system on NIST MT08.
  • The evaluation metric is caseinsensitive BLEU-4 (Papineni et al, 2002).
  • The feature weights of the translation system are tuned with the standard minimum-error-rate-training (MERT) (Och 2003) to maximize the systems’ BLEU score on the development set
  • Results:

    The authors observed that the proposed method significantly improves the translation quality of the conventional NMT system.
  • By adding the word translation table and the word reward features, the method obtained significant improvements over the baseline.
  • The average lengths of the outputs on the test set of the system and GroundHog are 23.5 and 21.4, respectively.
  • This indicates that the method alleviates the inadequate translation problem.
  • Further analyses and discussions will be described
  • Conclusion:

    In order to further study the performance of the proposed method, the authors compared the outputs of the systems.

    Improving Lexical Translation

    Taking the first sentence in Table 2 as an example, the GroundHog system omits the translation of “传输(chuanshu) transmission”.
  • This can be attributed to the fact that the translation table consists of word pairs with translation probabilities estimated from the word-aligned training corpus, providing another way to measure the relationship between source and target words
  • In this example, the translation table contains the word pairs “chuanshu, transmission” with high probabilities.Conclusion and Future Work.
  • The translation table is trained on word-aligned bilingual corpus via the standard phrase-based SMT method, and the language model is trained on monolingual target sentences.
  • The authors plan to improve NMT with phrase pairs, which are good at capturing local word reordering, idiom translation, etc
Tables
  • Table1: BLEU scores on development and test sets. TM=translation model, WR=word reward, LM=language model, PBSMT=phrase based SMT
  • Table2: Translation examples. Chinese words in bold are correctly translated by our system
  • Table3: Statistics of the percentages of the OOV words for the PBSMT, GroundHog and our method
  • Table4: Effect of translating OOV words. Our Method = GroundHog+TM+WR+LM, -OOV means the translation table is not used to recover OOV words
Download tables as Excel
Funding
  • This research is supported by the National Basic Research Program of China (973 program No 2014CB340505)
Reference
  • Auli, M.; Galley, M.; Quirk, C.; and Zweig, G. 2013. Joint language and translation modeling with recurrent neural networks. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 10441054.
    Google ScholarLocate open access versionFindings
  • Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. In arXiv:1409.0473 [cs.CL].
    Findings
  • Brown, P. F.; Pietra, S. A. D.; Pietra, V. J. D.; and Mercer, R. L. 199The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2):263–311.
    Google ScholarLocate open access versionFindings
  • Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 201Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 17241734.
    Google ScholarLocate open access versionFindings
  • Devlin, J.; Zbib, R.; Huang, Z.; Lamar, T.; Schwartz, R.; and Makhoul, J. 2014. Fast and robust neural network joint models for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 13701380.
    Google ScholarLocate open access versionFindings
  • Gulcehre, C.; Firat, O.; Xu, K.; Cho, K.; Barrault, L.; Lin, H.-C.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2015. On using monolingual corpora in neural machine translation. In arXiv:1503.03535 [cs.CL].
    Findings
  • Hu, X.; Li, W.; Lan, X.; Wu, H.; and Wang, H. 2015. Optimized beam search with constrained softmax for nmt. In MT Summit XV.
    Google ScholarLocate open access versionFindings
  • Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.; Dyer, C.; Bojar, O.; Constantin, A.; and Herbst, E. 2007. Moses: Open source toolkit for statistical machine translation. In ACL 2007 demonstration session.
    Google ScholarFindings
  • Koehn, P.; Och, F. J.; and Marcu, D. 2003. Statistical phrasebased translation. In Proceedings of HLT-NAACL 2003, 127–133.
    Google ScholarLocate open access versionFindings
  • Li, P.; Liu, Y.; and Sun, M. 2013. Recursive autoencoders for itg-based translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 567577.
    Google ScholarLocate open access versionFindings
  • Luong, M.-T.; Sutskever, I.; Le, Q. V.; Vinyals, O.; and Zaremba, W. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 11–19.
    Google ScholarLocate open access versionFindings
  • Och, F. J., and Ney, H. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 295–302.
    Google ScholarLocate open access versionFindings
  • Och, F. J., and Ney, H. 2004. The alignment template approach to statistical machine translation. 30:417–449.
    Google ScholarFindings
  • Och, F. J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, 160–167.
    Google ScholarLocate open access versionFindings
  • Riezler, S., and Maxwell, J. T. 2005. On some pitfalls in automatic evaluation and significance testing for mt. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 5764.
    Google ScholarLocate open access versionFindings
  • Schuster, M., and Paliwal, K. K. 1997. Bidirectional recurrent neural networks. Signal Processing IEEE Transactions on 45(11), 2673–2681.
    Google ScholarLocate open access versionFindings
  • Stolcke, A. 2002. Srilm – an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken language Processing, volume 2, 901–904.
    Google ScholarLocate open access versionFindings
  • Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS 2014.
    Google ScholarLocate open access versionFindings
  • Zeiler, M. D. 2012. Adadelta: An adaptive learning rate method. In arXiv:1212.5701 [cs.LG].
    Findings
  • Zhai, F.; Zhang, J.; Zhou, Y.; and Zong, C. 2013. Rnn-based derivation structure prediction for smt. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 779784.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments