From Credit Assignment to Entropy Regularization: Two New Algorithms for Neural Sequence Prediction

meeting of the association for computational linguistics, Volume abs/1804.10974, 2018, Pages 1672-1682.

Cited by: 3|Bibtex|Views33|Links
EI
Keywords:
credit assignmententropy regularizationrecurrent neural networksreinforcement learningtarget distributionMore(14+)
Weibo:
In this work, motivated by the intriguing connection between the token-level reward augmented maximum likelihood and the entropy-regularized reinforcement learning, we propose two algorithms for neural sequence prediction

Abstract:

In this work, we study the credit assignment problem in reward augmented maximum likelihood (RAML) learning, and establish a theoretical equivalence between the token-level counterpart of RAML and the entropy regularized reinforcement learning. Inspired by the connection, we propose two sequence prediction algorithms, one extending RAML w...More

Code:

Data:

Introduction
  • Modeling and predicting discrete sequences is the central problem to many natural language processing tasks.
  • MLE can suffer from the exposure bias, which refers to the phenomenon that the model is never exposed to its own failures during training, and cannot recover from an error at test time
  • This issue roots from the difficulty in statistically modeling the exponentially large space of sequences, where most combinations cannot be covered by the observed data.
  • Compared to algorithms trying to directly optimize task metric, RAML avoids the difficulty of tracking and sampling from the model distribution that is consistently changing.
  • Given an entropy-regularized MDP, for any fixed policy π, the state-value function V π(s) and the action-value function Qπ can be defined as
Highlights
  • Modeling and predicting discrete sequences is the central problem to many natural language processing tasks
  • maximum likelihood estimation can suffer from the exposure bias, which refers to the phenomenon that the model is never exposed to its own failures during training, and cannot recover from an error at test time
  • We propose two algorithms for neural sequence prediction, where one is the token-level extension to reward augmented maximum likelihood, and the other a reward augmented maximum likelihood-inspired improvement to the AC (§4)
  • In this work, motivated by the intriguing connection between the token-level reward augmented maximum likelihood and the entropy-regularized reinforcement learning, we propose two algorithms for neural sequence prediction
  • Despite the distinct training procedures, both algorithms combine the idea of fine-grained credit assignment and the entropy regularization, leading to positive empirical results
  • We believe the ground-truth reference contains sufficient information for such an oracle, and the current bottleneck lies in the reinforcement learning algorithm
Methods
  • The authors focus on two sequence prediction tasks: machine translation and image captioning.
  • Due to the space limit, the authors only present the information necessary to compare the empirical results at this moment.
  • For a more detailed description, the authors refer readers to Appendix B and the code6.
  • Machine Translation Following Ranzato et al (2015), the authors evaluate on IWSLT 2014 German-toEnglish dataset (Mauro et al, 2012).
  • The authors follow the pre-processing procedure used in (Ranzato et al, 2015)
Conclusion
  • In this work, motivated by the intriguing connection between the token-level RAML and the entropy-regularized RL, the authors propose two algorithms for neural sequence prediction.
  • Despite the distinct training procedures, both algorithms combine the idea of fine-grained credit assignment and the entropy regularization, leading to positive empirical results.
  • The authors believe the ground-truth reference contains sufficient information for such an oracle, and the current bottleneck lies in the RL algorithm.
  • Given the numerous potential applications of such an oracle, the authors believe improving its accuracy will be a promising future direction
Summary
  • Introduction:

    Modeling and predicting discrete sequences is the central problem to many natural language processing tasks.
  • MLE can suffer from the exposure bias, which refers to the phenomenon that the model is never exposed to its own failures during training, and cannot recover from an error at test time
  • This issue roots from the difficulty in statistically modeling the exponentially large space of sequences, where most combinations cannot be covered by the observed data.
  • Compared to algorithms trying to directly optimize task metric, RAML avoids the difficulty of tracking and sampling from the model distribution that is consistently changing.
  • Given an entropy-regularized MDP, for any fixed policy π, the state-value function V π(s) and the action-value function Qπ can be defined as
  • Methods:

    The authors focus on two sequence prediction tasks: machine translation and image captioning.
  • Due to the space limit, the authors only present the information necessary to compare the empirical results at this moment.
  • For a more detailed description, the authors refer readers to Appendix B and the code6.
  • Machine Translation Following Ranzato et al (2015), the authors evaluate on IWSLT 2014 German-toEnglish dataset (Mauro et al, 2012).
  • The authors follow the pre-processing procedure used in (Ranzato et al, 2015)
  • Conclusion:

    In this work, motivated by the intriguing connection between the token-level RAML and the entropy-regularized RL, the authors propose two algorithms for neural sequence prediction.
  • Despite the distinct training procedures, both algorithms combine the idea of fine-grained credit assignment and the entropy regularization, leading to positive empirical results.
  • The authors believe the ground-truth reference contains sufficient information for such an oracle, and the current bottleneck lies in the RL algorithm.
  • Given the numerous potential applications of such an oracle, the authors believe improving its accuracy will be a promising future direction
Tables
  • Table1: Test results on two benchmark tasks. Bold faces highlight the best in the corresponding category
  • Table2: Comparison with existing algorithms on IWSTL 2014 dataset for MT. All numbers of previous algorithms are from the original work
  • Table3: Average validation BLEU of ERAC. As a reference, the average BLEU is 28.1 for MLE. λvar = 0 means not using the smoothing technique. β = 1 means not using a target network. † indicates excluding extreme values due to divergence
  • Table4: Comparing ERAC with the variant without considering future entropy
Download tables as Excel
Related work
  • Task Loss Optimization and Exposure Bias Apart from the previously introduced RAML, BSO, Actor-Critic (§1), MIXER (Ranzato et al, 2015) also utilizes chunk-level signals where the length of chunk grows as training proceeds. In contrast, minimum risk training (Shen et al, 2015) directly optimizes sentence-level BLEU. As a result, it requires a large number (100) of samples per data to work well. To solve the exposure bias, scheduled sampling (Bengio et al, 2015) adopts a curriculum learning strategy to bridge the training and the inference. Professor forcing (Lamb et al, 2016) introduces an adversarial training mechanism to encourage the dynamics of the model to be the same at training time and inference time. For image caption, self-critic sequence training (SCST) (Rennie et al, 2016) extends the MIXER algorithm with an improved baseline based on the current model performance.
Reference
  • Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086.
    Findings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
    Findings
  • Leemon Baird. 1995. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, Elsevier, pages 30–37.
    Google ScholarLocate open access versionFindings
  • Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems. pages 1171–1179.
    Google ScholarLocate open access versionFindings
  • William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016.
    Google ScholarFindings
  • Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, pages 4960–4964.
    Google ScholarLocate open access versionFindings
  • Hal Daume III and Daniel Marcu. 2005. Learning as search optimization: Approximate large margin methods for structured prediction. In Proceedings of the 22nd international conference on Machine learning. ACM, pages 169–176.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165.
    Findings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. pages 770– 778.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Po-Sen Huang, Chong Wang, Dengyong Zhou, and Li Deng. 2017. Toward neural phrasebased machine translation.
    Google ScholarFindings
  • Andrej Karpathy and Li Fei-Fei. 2015. Deep visualsemantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. pages 3128–3137.
    Google ScholarLocate open access versionFindings
  • Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. 2016. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems. pages 4601–4609.
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Will Monroe, and Dan Jurafsky. 2017. Learning to decode for future success. arXiv preprint arXiv:1701.06549.
    Findings
  • Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
    Findings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, pages 740–755.
    Google ScholarLocate open access versionFindings
  • Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025.
    Findings
  • Xuezhe Ma, Pengcheng Yin, Jingzhou Liu, Graham Neubig, and Eduard Hovy. 2017. Softmax qdistribution estimation for structured prediction: A theoretical interpretation for raml. arXiv preprint arXiv:1705.07136.
    Findings
  • Cettolo Mauro, Girardi Christian, and Federico Marcello. 2012. Wit3: Web inventory of transcribed and translated talks. In Conference of European Association for Machine Translation. pages 261–268.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning. pages 1928–1937.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529.
    Google ScholarLocate open access versionFindings
  • Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. 2017. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems. pages 2772–2782.
    Google ScholarLocate open access versionFindings
  • Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. 2016. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems. pages 1723–1731.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pages 311–318.
    Google ScholarLocate open access versionFindings
  • Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
    Findings
  • Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2016. Self-critical sequence training for image captioning. arXiv preprint arXiv:1612.00563.
    Findings
  • Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
    Findings
  • John Schulman, Pieter Abbeel, and Xi Chen. 2017. Equivalence between policy gradients and soft qlearning. arXiv preprint arXiv:1704.06440.
    Findings
  • Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2015. Minimum risk training for neural machine translation. arXiv preprint arXiv:1512.02433.
    Findings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. pages 3104–3112.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, pages 3156–3164.
    Google ScholarLocate open access versionFindings
  • Ronald J Williams and Jing Peng. 1991. Function optimization using connectionist reinforcement learning algorithms. Connection Science 3(3):241–268.
    Google ScholarLocate open access versionFindings
  • Sam Wiseman and Alexander M Rush. 2016. Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960.
    Findings
  • Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015.
    Google ScholarFindings
  • Brian D Ziebart. 2010. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University.
    Google ScholarFindings
  • Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. 2008. Maximum entropy inverse reinforcement learning. In AAAI. Chicago, IL, USA, volume 8, pages 1433–1438.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments