# Massive Exploration of Neural Machine Translation Architectures

empirical methods in natural language processing, Volume abs/1703.03906, 2017.

EI

Keywords:

Wei bo:

Abstract:

Neural Machine Translation (NMT) has shown remarkable progress over the past few years with production systems now being deployed to end-users. One major drawback of current architectures is that they are expensive to train, typically requiring days to weeks of GPU time to converge. This makes exhaustive hyperparameter search, as is commo...More

Code:

Data:

Introduction

- Neural Machine Translation (NMT) (Kalchbrenner and Blunsom, 2013; Sutskever et al, 2014; Cho et al, 2014) is an end-to-end approach to machine translation.
- The most popular approaches to NMT are based on sequence-to-sequence models, an encoder-decoder architecture consisting of two recurrent neural networks (RNNs) and an attention mechanism that aligns target with source tokens (Bahdanau et al, 2015; Luong et al, 2015a).
- The probability of each target token yi ∈ 1, ...V is predicted based on the recurrent state in the decoder RNN si, the previous words, y
- The context vector ci is called the attention vector and is calculated as a weighted average of the source states

Highlights

- Neural Machine Translation (NMT) (Kalchbrenner and Blunsom, 2013; Sutskever et al, 2014; Cho et al, 2014) is an end-to-end approach to machine translation
- To Neural Machine Translation are based on sequence-to-sequence models, an encoder-decoder architecture consisting of two recurrent neural networks (RNNs) and an attention mechanism that aligns target with source tokens (Bahdanau et al, 2015; Luong et al, 2015a)
- One drawback of current Neural Machine Translation architectures is the huge amount of compute required to train them
- While sweeping across large hyperparameter spaces is common in Computer Vision (Huang et al, 2016b), such exploration would be prohibitively expensive for Neural Machine Translation models, limiting researchers to well-established architecture and hyperparameter choices
- Using a total of more than 250,000 GPU hours, we explore common variations of Neural Machine Translation architectures and provide insights into which architectural choices matter most
- We found that deep encoders are Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1442–1451 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics more difficult to optimize than decoders, that dense residual connections yield better performance than regular residual connections, and that a well-tuned beam search is surprisingly critical to obtaining state-of-the-art results

Results

- For the sake of brevity, the authors only report mean BLEU, standard deviation, highest BLEU in parentheses, and model size in the following tables.
- While table 1 shows that 2048-dimensional embeddings yielded the overall best result, they only outperformed the smallest 128-dimensional embeddings by a narrow yet statistically significant margin (p = 0.01), but took nearly twice as long to converge
- Gradient updates to both small and large embeddings did not differ significantly from each other and the norm of gradient updates to the embedding matrix stayed approximately constant throughout training, regardless of size.
- It could be the case that models with large embeddings need far more than 2.5M steps to converge to the best solution

Conclusion

- The authors conducted a large-scale empirical analysis of architecture variations for Neural Machine Translation, teasing apart the key factors to achieving state-of-the-art results.
- Residual connections were necessary to train decoders with 8 layers and dense residual connections offer additional robustness

Summary

## Introduction:

Neural Machine Translation (NMT) (Kalchbrenner and Blunsom, 2013; Sutskever et al, 2014; Cho et al, 2014) is an end-to-end approach to machine translation.- The most popular approaches to NMT are based on sequence-to-sequence models, an encoder-decoder architecture consisting of two recurrent neural networks (RNNs) and an attention mechanism that aligns target with source tokens (Bahdanau et al, 2015; Luong et al, 2015a).
- The probability of each target token yi ∈ 1, ...V is predicted based on the recurrent state in the decoder RNN si, the previous words, y
- The context vector ci is called the attention vector and is calculated as a weighted average of the source states
## Results:

For the sake of brevity, the authors only report mean BLEU, standard deviation, highest BLEU in parentheses, and model size in the following tables.- While table 1 shows that 2048-dimensional embeddings yielded the overall best result, they only outperformed the smallest 128-dimensional embeddings by a narrow yet statistically significant margin (p = 0.01), but took nearly twice as long to converge
- Gradient updates to both small and large embeddings did not differ significantly from each other and the norm of gradient updates to the embedding matrix stayed approximately constant throughout training, regardless of size.
- It could be the case that models with large embeddings need far more than 2.5M steps to converge to the best solution
## Conclusion:

The authors conducted a large-scale empirical analysis of architecture variations for Neural Machine Translation, teasing apart the key factors to achieving state-of-the-art results.- Residual connections were necessary to train decoders with 8 layers and dense residual connections offer additional robustness

- Table1: BLEU scores on newstest2013, varying the embedding dimensionality
- Table2: BLEU scores on newstest2013, varying the type of encoder and decoder cell
- Table3: BLEU scores on newstest2013, varying the encoder and decoder depth and type of residual connections
- Table4: BLEU scores on newstest2013, varying the type of encoder. The ”R” suffix indicates a reversed source sequence
- Table5: BLEU scores on newstest2013, varying the type of attention mechanism
- Table6: BLEU scores on newstest2013, varying the beam width and adding length penalties (LP)
- Table7: Hyperparameter settings for our final combined model, consisting of all of the individually optimized values
- Table8: Comparison to RNNSearch (<a class="ref-link" id="cJean_et+al_2015_a" href="#rJean_et+al_2015_a"><a class="ref-link" id="cJean_et+al_2015_a" href="#rJean_et+al_2015_a">Jean et al, 2015</a></a>), RNNSearch-LV (<a class="ref-link" id="cJean_et+al_2015_a" href="#rJean_et+al_2015_a"><a class="ref-link" id="cJean_et+al_2015_a" href="#rJean_et+al_2015_a">Jean et al, 2015</a></a>), BPE (<a class="ref-link" id="cSennrich_et+al_2016_b" href="#rSennrich_et+al_2016_b">Sennrich et al, 2016b</a>), BPE-Char (<a class="ref-link" id="cChung_et+al_2016_a" href="#rChung_et+al_2016_a">Chung et al, 2016</a>), Deep-Att (Zhou et al, 2016), Luong (<a class="ref-link" id="cLuong_et+al_2015_a" href="#rLuong_et+al_2015_a">Luong et al, 2015a</a>), Deep-Conv (<a class="ref-link" id="cGehring_et+al_2016_a" href="#rGehring_et+al_2016_a">Gehring et al, 2016</a>), GNMT (Wu et al, 2016), and OpenNMT (<a class="ref-link" id="cKlein_et+al_2017_a" href="#rKlein_et+al_2017_a">Klein et al, 2017</a>). Systems with an * do not have a public implementation

Reference

- Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In OSDI.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.
- Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. In ACL.
- Jonas Gehring, Michael Auli, David Grangier, and Yann N. Dauphin. 2016. A convolutional encoder model for neural machine translation. CoRR abs/1611.02344.
- Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnık, Bas R. Steunebrink, and Jurgen Schmidhuber. 201LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems PP(99):1–11.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
- Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9(8):1735–1780.
- Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016a. Densely connected convolutional networks. CoRR abs/1608.06993.
- Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. 2016b. Speed/accuracy trade-offs for modern convolutional object detectors. CoRR abs/1611.10012.
- Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In ACL.
- Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In EMNLP.
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visualsemantic alignments for generating image descriptions. In CVPR.
- Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. CoRR abs/1701.02810.
- Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 20A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
- Minh-Thang Luong and Christopher D. Manning. 20Achieving open vocabulary neural machine translation with hybrid word-character models. In ACL.
- Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches to attentionbased neural machine translation. In EMNLP.
- Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015b. Addressing the rare word problem in neural machine translation. In ACL.
- Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Edinburgh neural machine translation systems for wmt 16. In ACL.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In ACL.
- Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364.
- Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714.
- Rupesh Kumar Srivastava, Klaus Greff, and Jurgen Schmidhuber. 2015. Highway networks. arXiv preprint arXiv:1505.00387.
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
- Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. 2017. Neural machine translation with reconstruction. In AAAI.
- Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. In ACL.
- Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. 2015.
- Show, attend and tell: Neural image caption generation with visual attention. In ICML. Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. 2016. Deep recurrent models with fast-forward connections for neural machine translation. TACL.

Tags

Comments