# From Credit Assignment to Entropy Regularization: Two New Algorithms for Neural Sequence Prediction

meeting of the association for computational linguistics, Volume abs/1804.10974, 2018, Pages 1672-1682.

EI

Keywords:

credit assignmententropy regularizationrecurrent neural networksreinforcement learningtarget distributionMore(14+)

Weibo:

Abstract:

In this work, we study the credit assignment problem in reward augmented maximum likelihood (RAML) learning, and establish a theoretical equivalence between the token-level counterpart of RAML and the entropy regularized reinforcement learning. Inspired by the connection, we propose two sequence prediction algorithms, one extending RAML w...More

Code:

Data:

Introduction

- Modeling and predicting discrete sequences is the central problem to many natural language processing tasks.
- MLE can suffer from the exposure bias, which refers to the phenomenon that the model is never exposed to its own failures during training, and cannot recover from an error at test time
- This issue roots from the difficulty in statistically modeling the exponentially large space of sequences, where most combinations cannot be covered by the observed data.
- Compared to algorithms trying to directly optimize task metric, RAML avoids the difficulty of tracking and sampling from the model distribution that is consistently changing.
- Given an entropy-regularized MDP, for any fixed policy π, the state-value function V π(s) and the action-value function Qπ can be defined as

Highlights

- Modeling and predicting discrete sequences is the central problem to many natural language processing tasks
- maximum likelihood estimation can suffer from the exposure bias, which refers to the phenomenon that the model is never exposed to its own failures during training, and cannot recover from an error at test time
- We propose two algorithms for neural sequence prediction, where one is the token-level extension to reward augmented maximum likelihood, and the other a reward augmented maximum likelihood-inspired improvement to the AC (§4)
- In this work, motivated by the intriguing connection between the token-level reward augmented maximum likelihood and the entropy-regularized reinforcement learning, we propose two algorithms for neural sequence prediction
- Despite the distinct training procedures, both algorithms combine the idea of fine-grained credit assignment and the entropy regularization, leading to positive empirical results
- We believe the ground-truth reference contains sufficient information for such an oracle, and the current bottleneck lies in the reinforcement learning algorithm

Methods

- The authors focus on two sequence prediction tasks: machine translation and image captioning.
- Due to the space limit, the authors only present the information necessary to compare the empirical results at this moment.
- For a more detailed description, the authors refer readers to Appendix B and the code6.
- Machine Translation Following Ranzato et al (2015), the authors evaluate on IWSLT 2014 German-toEnglish dataset (Mauro et al, 2012).
- The authors follow the pre-processing procedure used in (Ranzato et al, 2015)

Conclusion

- In this work, motivated by the intriguing connection between the token-level RAML and the entropy-regularized RL, the authors propose two algorithms for neural sequence prediction.
- Despite the distinct training procedures, both algorithms combine the idea of fine-grained credit assignment and the entropy regularization, leading to positive empirical results.
- The authors believe the ground-truth reference contains sufficient information for such an oracle, and the current bottleneck lies in the RL algorithm.
- Given the numerous potential applications of such an oracle, the authors believe improving its accuracy will be a promising future direction

Summary

## Introduction:

Modeling and predicting discrete sequences is the central problem to many natural language processing tasks.- MLE can suffer from the exposure bias, which refers to the phenomenon that the model is never exposed to its own failures during training, and cannot recover from an error at test time
- This issue roots from the difficulty in statistically modeling the exponentially large space of sequences, where most combinations cannot be covered by the observed data.
- Compared to algorithms trying to directly optimize task metric, RAML avoids the difficulty of tracking and sampling from the model distribution that is consistently changing.
- Given an entropy-regularized MDP, for any fixed policy π, the state-value function V π(s) and the action-value function Qπ can be defined as
## Methods:

The authors focus on two sequence prediction tasks: machine translation and image captioning.- Due to the space limit, the authors only present the information necessary to compare the empirical results at this moment.
- For a more detailed description, the authors refer readers to Appendix B and the code6.
- Machine Translation Following Ranzato et al (2015), the authors evaluate on IWSLT 2014 German-toEnglish dataset (Mauro et al, 2012).
- The authors follow the pre-processing procedure used in (Ranzato et al, 2015)
## Conclusion:

In this work, motivated by the intriguing connection between the token-level RAML and the entropy-regularized RL, the authors propose two algorithms for neural sequence prediction.- Despite the distinct training procedures, both algorithms combine the idea of fine-grained credit assignment and the entropy regularization, leading to positive empirical results.
- The authors believe the ground-truth reference contains sufficient information for such an oracle, and the current bottleneck lies in the RL algorithm.
- Given the numerous potential applications of such an oracle, the authors believe improving its accuracy will be a promising future direction

- Table1: Test results on two benchmark tasks. Bold faces highlight the best in the corresponding category
- Table2: Comparison with existing algorithms on IWSTL 2014 dataset for MT. All numbers of previous algorithms are from the original work
- Table3: Average validation BLEU of ERAC. As a reference, the average BLEU is 28.1 for MLE. λvar = 0 means not using the smoothing technique. β = 1 means not using a target network. † indicates excluding extreme values due to divergence
- Table4: Comparing ERAC with the variant without considering future entropy

Related work

- Task Loss Optimization and Exposure Bias Apart from the previously introduced RAML, BSO, Actor-Critic (§1), MIXER (Ranzato et al, 2015) also utilizes chunk-level signals where the length of chunk grows as training proceeds. In contrast, minimum risk training (Shen et al, 2015) directly optimizes sentence-level BLEU. As a result, it requires a large number (100) of samples per data to work well. To solve the exposure bias, scheduled sampling (Bengio et al, 2015) adopts a curriculum learning strategy to bridge the training and the inference. Professor forcing (Lamb et al, 2016) introduces an adversarial training mechanism to encourage the dynamics of the model to be the same at training time and inference time. For image caption, self-critic sequence training (SCST) (Rennie et al, 2016) extends the MIXER algorithm with an improved baseline based on the current model performance.

Reference

- Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Leemon Baird. 1995. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, Elsevier, pages 30–37.
- Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems. pages 1171–1179.
- William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016.
- Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, pages 4960–4964.
- Hal Daume III and Daniel Marcu. 2005. Learning as search optimization: Approximate large margin methods for structured prediction. In Proceedings of the 22nd international conference on Machine learning. ACM, pages 169–176.
- Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165.
- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. pages 770– 778.
- Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- Po-Sen Huang, Chong Wang, Dengyong Zhou, and Li Deng. 2017. Toward neural phrasebased machine translation.
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visualsemantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. pages 3128–3137.
- Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. 2016. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems. pages 4601–4609.
- Jiwei Li, Will Monroe, and Dan Jurafsky. 2017. Learning to decode for future success. arXiv preprint arXiv:1701.06549.
- Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, pages 740–755.
- Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025.
- Xuezhe Ma, Pengcheng Yin, Jingzhou Liu, Graham Neubig, and Eduard Hovy. 2017. Softmax qdistribution estimation for structured prediction: A theoretical interpretation for raml. arXiv preprint arXiv:1705.07136.
- Cettolo Mauro, Girardi Christian, and Federico Marcello. 2012. Wit3: Web inventory of transcribed and translated talks. In Conference of European Association for Machine Translation. pages 261–268.
- Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning. pages 1928–1937.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529.
- Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. 2017. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems. pages 2772–2782.
- Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. 2016. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems. pages 1723–1731.
- Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pages 311–318.
- Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
- Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2016. Self-critical sequence training for image captioning. arXiv preprint arXiv:1612.00563.
- Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
- John Schulman, Pieter Abbeel, and Xi Chen. 2017. Equivalence between policy gradients and soft qlearning. arXiv preprint arXiv:1704.06440.
- Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2015. Minimum risk training for neural machine translation. arXiv preprint arXiv:1512.02433.
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. pages 3104–3112.
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, pages 3156–3164.
- Ronald J Williams and Jing Peng. 1991. Function optimization using connectionist reinforcement learning algorithms. Connection Science 3(3):241–268.
- Sam Wiseman and Alexander M Rush. 2016. Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960.
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015.
- Brian D Ziebart. 2010. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University.
- Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. 2008. Maximum entropy inverse reinforcement learning. In AAAI. Chicago, IL, USA, volume 8, pages 1433–1438.

Tags

Comments