Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.
EMNLP, (2014): 1724-1734
In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model ...More
PPT (Upload PPT)
- Deep neural networks have shown great success in various applications such as objection recognition (see, e.g., (Krizhevsky et al, 2012)) and speech recognition (see, e.g., (Dahl et al, 2012)).
- Many recent works showed that neural networks can be successfully used in a number of tasks in natural language processing (NLP).
- These include, but are not limited to, language modeling (Bengio et al, 2003), paraphrase detection (Socher et al, 2011) and word embedding extraction (Mikolov et al, 2013).
- (Schwenk, 2012) summarizes a successful usage of feedforward neural networks in the framework of phrase-based SMT system.
- The authors propose to use a rather sophisticated hidden unit in order to improve both the memory capacity and the ease of training
- Deep neural networks have shown great success in various applications such as objection recognition (see, e.g., (Krizhevsky et al, 2012)) and speech recognition (see, e.g., (Dahl et al, 2012))
- In the field of statistical machine translation (SMT), deep neural networks have begun to show promising results. (Schwenk, 2012) summarizes a successful usage of feedforward neural networks in the framework of phrase-based statistical machine translation system. Along this line of research on using neural networks for statistical machine translation, this paper focuses on a novel neural network architecture that can be used as a part of the conventional phrase-based statistical machine translation system
- We proposed a new neural network architecture, called an recurrent neural networks Encoder–Decoder that is able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length
- We evaluated the proposed model with the task of statistical machine translation, where we used the recurrent neural networks Encoder–Decoder to score each phrase pair in the phrase table
- We were able to show that the new model is able to capture linguistic regularities in the phrase pairs well and that the recurrent neural networks Encoder–Decoder is able to propose well-formed target phrases
- We found that the contribution by the recurrent neural networks Encoder– Decoder is rather orthogonal to the existing approach of using neural networks in the statistical machine translation system, so that we can improve further the performance by using, for instance, the recurrent neural networks Encoder– Decoder and the neural net language model together
- The authors evaluate the approach on the English/French translation task of the WMT’14 workshop.
4.1 Data and Baseline System
Large amounts of resources are available to build an English/French SMT system in the framework of the WMT’14 translation task.
- The authors have done so by applying the data selection method proposed in (Moore and Lewis, 2010), and its extension to bitexts (Axelrod et al, 2011)
- By these means the authors selected a subset of 418M words out of more than 2G words for language modeling and a subset of 348M out of 850M words for training the RNN Encoder–Decoder.
- Each set has more than 70 thousand words and a single reference translation
- The authors proposed a new neural network architecture, called an RNN Encoder–Decoder that is able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length.
- The proposed RNN Encoder–Decoder is able to either score a pair of sequences or generate a target sequence given a source sequence.
- The authors evaluated the proposed model with the task of statistical machine translation, where the authors used the RNN Encoder–Decoder to score each phrase pair in the phrase table.
- The authors were able to show that the new model is able to capture linguistic regularities in the phrase pairs well and that the RNN Encoder–Decoder is able to propose well-formed target phrases.
- The authors found that the contribution by the RNN Encoder– Decoder is rather orthogonal to the existing approach of using neural networks in the SMT system, so that the authors can improve further the performance by using, for instance, the RNN Encoder– Decoder and the neural net language model together
- Table1: BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks
- Table2: The top scoring target phrases for a small set of source phrases according to the translation model (direct translation probability) and by the RNN Encoder–Decoder. Source phrases were randomly selected from phrases with 4 or more words. ? denotes an incomplete (partial) character. r is a Cyrillic letter ghe
- Table3: Samples generated from the RNN Encoder–Decoder for each source phrase used in Table 2. We show the top-5 target phrases out of 50 samples. They are sorted by the RNN Encoder–Decoder scores
- FB and HS were partially funded by the European Commission under the project MateCat, and by DARPA under the BOLT project
Study subjects and analysis
randomly selected phrase pairs: 64
We used Adadelta and stochastic gradient descent to train the RNN Encoder–Decoder with hyperparameters = 10−6 and ρ = 0.95 (Zeiler, 2012). At each update, we used 64 randomly selected phrase pairs from a phrase table (which was created from 348M words). The model was trained for approximately three days
Furthermore, in Table 3, we show for each of the source phrases in Table 2, the generated samples from the RNN Encoder–Decoder. For each source phrase, we generated 50 samples and show the top-five phrases accordingly to their scores. We can see that the RNN Encoder–Decoder is able to propose well-formed target phrases without looking at the actual phrase table
- [Auli et al.2013] Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig. 2013. Joint language and translation modeling with recurrent neural networks. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1044–1054.
- [Axelrod et al.2011] Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355–362.
- [Bastien et al.2012] Frederic Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. 2012. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
- [Bengio et al.2003] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March.
- [Bengio et al.2013] Y. Bengio, N. BoulangerLewandowski, and R. Pascanu. 2013. Advances in optimizing recurrent networks. In Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013), May.
- [Bergstra et al.2010] James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David WardeFarley, and Yoshua Bengio. 2010. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June. Oral Presentation.
- [Chandar et al.2014] Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas Raykar, and Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. arXiv:1402.1454 [cs.CL], February.
- [Dahl et al.2012] George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pretrained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42.
- [Devlin et al.2014] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz,, and John Makhoul. 2014. Fast and robust neural network joint models for statistical machine translation. In Proceedings of the ACL 2014 Conference, ACL ’14, pages 1370–1380.
- [Gao et al.2013] Jianfeng Gao, Xiaodong He, Wen tau Yih, and Li Deng. 2013. Learning semantic representations for the phrase translation model. Technical report, Microsoft Research.
- [Glorot et al.2011] X. Glorot, A. Bordes, and Y. Bengio. 20Deep sparse rectifier neural networks. In AISTATS’2011.
- [Goodfellow et al.2013] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. 2013. Maxout networks. In ICML’2013.
- [Graves2012] Alex Graves. 2012. Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence. Springer.
- [Hochreiter and Schmidhuber1997] S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- [Kalchbrenner and Blunsom2013] Nal Kalchbrenner and Phil Blunsom. 2013. Two recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709.
- [Koehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54.
- [Koehn2005] P. Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit X, pages 79–86, Phuket, Thailand.
- [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012).
- [Marcu and Wong2002] Daniel Marcu and William Wong. 2002. A phrase-based, joint probability model for statistical machine translation. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 133–139.
- [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119.
- [Moore and Lewis2010] Robert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’10, pages 220–224, Stroudsburg, PA, USA.
- [Pascanu et al.2014] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. 2014. How to construct deep recurrent neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014), April.
- [Saxe et al.2014] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. 2014. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014), April.
- [Schwenk et al.2006] Holger Schwenk, Marta R. CostaJussa, and Jose A. R. Fonollosa. 2006. Continuous space language models for the iwslt 2006 task. In IWSLT, pages 166–173.
- [Schwenk2007] Holger Schwenk. 2007. Continuous space language models. Comput. Speech Lang., 21(3):492–518, July.
- [Schwenk2012] Holger Schwenk. 2012. Continuous space translation models for phrase-based statistical machine translation. In Martin Kay and Christian Boitet, editors, Proceedings of the 24th International Conference on Computational Linguistics (COLIN), pages 1071–1080.
- [Socher et al.2011] Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances in Neural Information Processing Systems 24.
- [Son et al.2012] Le Hai Son, Alexandre Allauzen, and Francois Yvon. 2012. Continuous space translation models with neural networks. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pages 39–48, Stroudsburg, PA, USA.
- [van der Maaten2013] Laurens van der Maaten. 2013. Barnes-hut-sne. In Proceedings of the First International Conference on Learning Representations (ICLR 2013), May.
- [Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with large-scale neural language models improves translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1387–1392.
- [Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. Technical report, arXiv 1212.5701.
- [Zou et al.2013] Will Y. Zou, Richard Socher, Daniel M. Cer, and Christopher D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1393–1398.