Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.

EMNLP, (2014): 1724-1734

Cited by: 15366|Views871
EI
Full Text
Bibtex
Weibo

Abstract

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model ...More

Code:

Data:

0
Introduction
  • Deep neural networks have shown great success in various applications such as objection recognition (see, e.g., (Krizhevsky et al, 2012)) and speech recognition (see, e.g., (Dahl et al, 2012)).
  • Many recent works showed that neural networks can be successfully used in a number of tasks in natural language processing (NLP).
  • These include, but are not limited to, language modeling (Bengio et al, 2003), paraphrase detection (Socher et al, 2011) and word embedding extraction (Mikolov et al, 2013).
  • (Schwenk, 2012) summarizes a successful usage of feedforward neural networks in the framework of phrase-based SMT system.
  • The authors propose to use a rather sophisticated hidden unit in order to improve both the memory capacity and the ease of training
Highlights
  • Deep neural networks have shown great success in various applications such as objection recognition (see, e.g., (Krizhevsky et al, 2012)) and speech recognition (see, e.g., (Dahl et al, 2012))
  • In the field of statistical machine translation (SMT), deep neural networks have begun to show promising results. (Schwenk, 2012) summarizes a successful usage of feedforward neural networks in the framework of phrase-based statistical machine translation system. Along this line of research on using neural networks for statistical machine translation, this paper focuses on a novel neural network architecture that can be used as a part of the conventional phrase-based statistical machine translation system
  • We proposed a new neural network architecture, called an recurrent neural networks Encoder–Decoder that is able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length
  • We evaluated the proposed model with the task of statistical machine translation, where we used the recurrent neural networks Encoder–Decoder to score each phrase pair in the phrase table
  • We were able to show that the new model is able to capture linguistic regularities in the phrase pairs well and that the recurrent neural networks Encoder–Decoder is able to propose well-formed target phrases
  • We found that the contribution by the recurrent neural networks Encoder– Decoder is rather orthogonal to the existing approach of using neural networks in the statistical machine translation system, so that we can improve further the performance by using, for instance, the recurrent neural networks Encoder– Decoder and the neural net language model together
Methods
  • The authors evaluate the approach on the English/French translation task of the WMT’14 workshop.

    4.1 Data and Baseline System

    Large amounts of resources are available to build an English/French SMT system in the framework of the WMT’14 translation task.
  • The authors have done so by applying the data selection method proposed in (Moore and Lewis, 2010), and its extension to bitexts (Axelrod et al, 2011)
  • By these means the authors selected a subset of 418M words out of more than 2G words for language modeling and a subset of 348M out of 850M words for training the RNN Encoder–Decoder.
  • Each set has more than 70 thousand words and a single reference translation
Conclusion
  • The authors proposed a new neural network architecture, called an RNN Encoder–Decoder that is able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length.
  • The proposed RNN Encoder–Decoder is able to either score a pair of sequences or generate a target sequence given a source sequence.
  • The authors evaluated the proposed model with the task of statistical machine translation, where the authors used the RNN Encoder–Decoder to score each phrase pair in the phrase table.
  • The authors were able to show that the new model is able to capture linguistic regularities in the phrase pairs well and that the RNN Encoder–Decoder is able to propose well-formed target phrases.
  • The authors found that the contribution by the RNN Encoder– Decoder is rather orthogonal to the existing approach of using neural networks in the SMT system, so that the authors can improve further the performance by using, for instance, the RNN Encoder– Decoder and the neural net language model together
Tables
  • Table1: BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks
  • Table2: The top scoring target phrases for a small set of source phrases according to the translation model (direct translation probability) and by the RNN Encoder–Decoder. Source phrases were randomly selected from phrases with 4 or more words. ? denotes an incomplete (partial) character. r is a Cyrillic letter ghe
  • Table3: Samples generated from the RNN Encoder–Decoder for each source phrase used in Table 2. We show the top-5 target phrases out of 50 samples. They are sorted by the RNN Encoder–Decoder scores
Download tables as Excel
Funding
  • FB and HS were partially funded by the European Commission under the project MateCat, and by DARPA under the BOLT project
Study subjects and analysis
randomly selected phrase pairs: 64
We used Adadelta and stochastic gradient descent to train the RNN Encoder–Decoder with hyperparameters = 10−6 and ρ = 0.95 (Zeiler, 2012). At each update, we used 64 randomly selected phrase pairs from a phrase table (which was created from 348M words). The model was trained for approximately three days

samples: 50
Furthermore, in Table 3, we show for each of the source phrases in Table 2, the generated samples from the RNN Encoder–Decoder. For each source phrase, we generated 50 samples and show the top-five phrases accordingly to their scores. We can see that the RNN Encoder–Decoder is able to propose well-formed target phrases without looking at the actual phrase table

Reference
  • [Auli et al.2013] Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig. 2013. Joint language and translation modeling with recurrent neural networks. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1044–1054.
    Google ScholarLocate open access versionFindings
  • [Axelrod et al.2011] Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355–362.
    Google ScholarLocate open access versionFindings
  • [Bastien et al.2012] Frederic Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. 2012. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
    Google ScholarLocate open access versionFindings
  • [Bengio et al.2003] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March.
    Google ScholarLocate open access versionFindings
  • [Bengio et al.2013] Y. Bengio, N. BoulangerLewandowski, and R. Pascanu. 2013. Advances in optimizing recurrent networks. In Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013), May.
    Google ScholarLocate open access versionFindings
  • [Bergstra et al.2010] James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David WardeFarley, and Yoshua Bengio. 2010. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June. Oral Presentation.
    Google ScholarLocate open access versionFindings
  • [Chandar et al.2014] Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas Raykar, and Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. arXiv:1402.1454 [cs.CL], February.
    Findings
  • [Dahl et al.2012] George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pretrained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42.
    Google ScholarLocate open access versionFindings
  • [Devlin et al.2014] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz,, and John Makhoul. 2014. Fast and robust neural network joint models for statistical machine translation. In Proceedings of the ACL 2014 Conference, ACL ’14, pages 1370–1380.
    Google ScholarLocate open access versionFindings
  • [Gao et al.2013] Jianfeng Gao, Xiaodong He, Wen tau Yih, and Li Deng. 2013. Learning semantic representations for the phrase translation model. Technical report, Microsoft Research.
    Google ScholarFindings
  • [Glorot et al.2011] X. Glorot, A. Bordes, and Y. Bengio. 20Deep sparse rectifier neural networks. In AISTATS’2011.
    Google ScholarFindings
  • [Goodfellow et al.2013] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. 2013. Maxout networks. In ICML’2013.
    Google ScholarFindings
  • [Graves2012] Alex Graves. 2012. Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence. Springer.
    Google ScholarLocate open access versionFindings
  • [Hochreiter and Schmidhuber1997] S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • [Kalchbrenner and Blunsom2013] Nal Kalchbrenner and Phil Blunsom. 2013. Two recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709.
    Google ScholarLocate open access versionFindings
  • [Koehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54.
    Google ScholarLocate open access versionFindings
  • [Koehn2005] P. Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit X, pages 79–86, Phuket, Thailand.
    Google ScholarLocate open access versionFindings
  • [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012).
    Google ScholarLocate open access versionFindings
  • [Marcu and Wong2002] Daniel Marcu and William Wong. 2002. A phrase-based, joint probability model for statistical machine translation. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 133–139.
    Google ScholarLocate open access versionFindings
  • [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119.
    Google ScholarLocate open access versionFindings
  • [Moore and Lewis2010] Robert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’10, pages 220–224, Stroudsburg, PA, USA.
    Google ScholarLocate open access versionFindings
  • [Pascanu et al.2014] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. 2014. How to construct deep recurrent neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014), April.
    Google ScholarLocate open access versionFindings
  • [Saxe et al.2014] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. 2014. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014), April.
    Google ScholarLocate open access versionFindings
  • [Schwenk et al.2006] Holger Schwenk, Marta R. CostaJussa, and Jose A. R. Fonollosa. 2006. Continuous space language models for the iwslt 2006 task. In IWSLT, pages 166–173.
    Google ScholarLocate open access versionFindings
  • [Schwenk2007] Holger Schwenk. 2007. Continuous space language models. Comput. Speech Lang., 21(3):492–518, July.
    Google ScholarLocate open access versionFindings
  • [Schwenk2012] Holger Schwenk. 2012. Continuous space translation models for phrase-based statistical machine translation. In Martin Kay and Christian Boitet, editors, Proceedings of the 24th International Conference on Computational Linguistics (COLIN), pages 1071–1080.
    Google ScholarLocate open access versionFindings
  • [Socher et al.2011] Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances in Neural Information Processing Systems 24.
    Google ScholarLocate open access versionFindings
  • [Son et al.2012] Le Hai Son, Alexandre Allauzen, and Francois Yvon. 2012. Continuous space translation models with neural networks. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pages 39–48, Stroudsburg, PA, USA.
    Google ScholarLocate open access versionFindings
  • [van der Maaten2013] Laurens van der Maaten. 2013. Barnes-hut-sne. In Proceedings of the First International Conference on Learning Representations (ICLR 2013), May.
    Google ScholarLocate open access versionFindings
  • [Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with large-scale neural language models improves translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1387–1392.
    Google ScholarLocate open access versionFindings
  • [Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. Technical report, arXiv 1212.5701.
    Findings
  • [Zou et al.2013] Will Y. Zou, Richard Socher, Daniel M. Cer, and Christopher D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1393–1398.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科