Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning

Nino Vieillard
Nino Vieillard
Tadashi Kozuno
Tadashi Kozuno

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views67|Links
EI
Keywords:
Momentum Value Iteratione ectdynamic programmingMaximum a Posteriori Policy OptimizationSoft Actor CriticMore(15+)
Weibo:
We provided an explanation of the e ect of KL regularization in Reinforcement Learning, through the implicit averaging of q-values

Abstract:

Recent Reinforcement Learning (RL) algorithms making use of KullbackLeibler (KL) regularization as a core component have shown outstanding performance. Yet, only little is understood theoretically about why KL regularization helps, so far. We study KL regularization within an approximate value iteration scheme and show that it implicitly ...More

Code:

Data:

0
Introduction
  • In Reinforcement Learning (RL), Kullback-Leibler (KL) regularization consists in penalizing a new policy from being too far from the previous one, as measured by the KL divergence.
  • Geist et al [20] have analyzed algorithms operating in the larger scope of regularization by Bregman divergences
  • They concluded that regularization doesn’t harm in terms of convergence, rate of convergence, and propagation of errors, but these results are not better than the corresponding ones in unregularized approximate dynamic programming (ADP).
Highlights
  • In Reinforcement Learning (RL), Kullback-Leibler (KL) regularization consists in penalizing a new policy from being too far from the previous one, as measured by the KL divergence
  • Thm. 1 is the first result showing that an RL algorithm can benefit from both a linear dependency to the horizon and from an averaging of the errors, and we argue that this explains, at least partially, the beneficial e ect of using a KL regularization
  • We provided an explanation of the e ect of KL regularization in RL, through the implicit averaging of q-values
  • We provided a very strong performance bound for KL regularization, the very first RL bound showing both a linear dependency to the horizon and an averaging the estimation errors
  • The introduced abstract framework encompasses a number of existing approaches, but some assumptions we made do not hold when neural networks are used
  • Our result significantly improves previous analyses
  • The resulting approach, called “Munchausen Reinforcement Learning”, is simple and general, and provides agents outperforming the state of the art
Results
  • The authors' result significantly improves previous analyses
Conclusion
  • The authors provided an explanation of the e ect of KL regularization in RL, through the implicit averaging of q-values.
  • The authors complemented the thorough theoretical analysis with an extensive empirical study
  • It confirms that KL regularization is helpful, and that regularizing the evaluation step is never detrimental.
  • The resulting approach, called “Munchausen Reinforcement Learning”, is simple and general, and provides agents outperforming the state of the art.
  • Thanks to this reparameterization, there’s no error in their greedy step and the bounds apply readily.
  • More details can be found in [42]
Summary
  • Introduction:

    In Reinforcement Learning (RL), Kullback-Leibler (KL) regularization consists in penalizing a new policy from being too far from the previous one, as measured by the KL divergence.
  • Geist et al [20] have analyzed algorithms operating in the larger scope of regularization by Bregman divergences
  • They concluded that regularization doesn’t harm in terms of convergence, rate of convergence, and propagation of errors, but these results are not better than the corresponding ones in unregularized approximate dynamic programming (ADP).
  • Objectives:

    The authors' goal is to study the core e ect of regularization, especially of KL regularization, in a deep RL context.
  • Results:

    The authors' result significantly improves previous analyses
  • Conclusion:

    The authors provided an explanation of the e ect of KL regularization in RL, through the implicit averaging of q-values.
  • The authors complemented the thorough theoretical analysis with an extensive empirical study
  • It confirms that KL regularization is helpful, and that regularizing the evaluation step is never detrimental.
  • The resulting approach, called “Munchausen Reinforcement Learning”, is simple and general, and provides agents outperforming the state of the art.
  • Thanks to this reparameterization, there’s no error in their greedy step and the bounds apply readily.
  • More details can be found in [42]
Tables
  • Table1: Algorithms encompassed by MD/DA-MPI (in italic if new compared to [<a class="ref-link" id="c20" href="#r20">20</a>])
Download tables as Excel
Funding
  • Our result significantly improves previous analyses
Reference
  • Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvári, C., and Weisz, G. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. Maximum a posteriori policy optimisation. In International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • Ahmed, Z., Le Roux, N., Norouzi, M., and Schuurmans, D. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • Archibald, T., McKinnon, K., and Thomas, L. On the generation of markov decision processes. Journal of the Operational Research Society, 46(3):354–361, 1995.
    Google ScholarLocate open access versionFindings
  • Asadi, K. and Littman, M. L. An alternative softmax operator for reinforcement learning. In International Conference on Machine Learning (ICML), 2017.
    Google ScholarLocate open access versionFindings
  • Azar, M. G., Munos, R., Ghavamzadeh, M., and Kappen, H. J. Speedy q-learning. In Advances in neural information processing systems (NeurIPS), 2011.
    Google ScholarLocate open access versionFindings
  • Azar, M. G., Gómez, V., and Kappen, H. J. Dynamic policy programming. Journal of Machine Learning Research (JMLR), 13(Nov):3207–3245, 2012.
    Google ScholarLocate open access versionFindings
  • Bagnell, J. A., Kakade, S. M., Schneider, J. G., and Ng, A. Y. Policy search by dynamic programming. In Advances in neural information processing systems, pp. 831–838, 2004.
    Google ScholarLocate open access versionFindings
  • Baird III, L. C. Reinforcement Learning Through Gradient Descent. PhD thesis, US Air Force Academy, US, 1999.
    Google ScholarFindings
  • Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
    Google ScholarLocate open access versionFindings
  • Bellemare, M. G., Ostrovski, G., Guez, A., Thomas, P. S., and Munos, R. Increasing the action gap: New operators for reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), 2016.
    Google ScholarLocate open access versionFindings
  • Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.
    Google ScholarFindings
  • Bradtke, S. J. and Barto, A. G. Linear least-squares algorithms for temporal di erence learning. Machine learning, 22(1-3):33–57, 1996.
    Google ScholarLocate open access versionFindings
  • Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
    Findings
  • Castro, P. S., Moitra, S., Gelada, C., Kumar, S., and Bellemare, M. G. Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110, 2018.
    Findings
  • Fellows, M., Mahajan, A., Rudner, T. G., and Whiteson, S. Virel: A variational inference framework for reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 7122–7136, 2019.
    Google ScholarLocate open access versionFindings
  • Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Conference on Uncertainty in Artificial Intelligence (UAI), 2016.
    Google ScholarLocate open access versionFindings
  • Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1587–1596, 2018.
    Google ScholarLocate open access versionFindings
  • Geist, M., Piot, B., and Pietquin, O. Is the bellman residual a bad proxy? In Advances in Neural Information Processing Systems, pp. 3205–3214, 2017.
    Google ScholarLocate open access versionFindings
  • Geist, M., Scherrer, B., and Pietquin, O. A theory of regularized markov decision processes. In International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning (ICML), 2017.
    Google ScholarLocate open access versionFindings
  • Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: O -policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • Hiriart-Urruty, J.-B. and Lemaréchal, C. Fundamentals of convex analysis. Springer Science & Business Media, 2012.
    Google ScholarFindings
  • Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning (ICML), 2002.
    Google ScholarLocate open access versionFindings
  • Kozuno, T., Uchibe, E., and Doya, K. Theoretical analysis of e ciency and robustness of softmax and gap-increasing operators in reinforcement learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
    Google ScholarLocate open access versionFindings
  • Levine, S. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv preprint arXiv:1805.00909, 2018.
    Findings
  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
    Google ScholarLocate open access versionFindings
  • Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (ICML), 2016.
    Google ScholarLocate open access versionFindings
  • Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and e cient o -policy reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
    Google ScholarLocate open access versionFindings
  • Perolat, J., Scherrer, B., Piot, B., and Pietquin, O. Approximate dynamic programming for two-player zero-sum markov games. In International Conference on Machine Learning (ICML), 2015.
    Google ScholarLocate open access versionFindings
  • Pérolat, J., Piot, B., Geist, M., Scherrer, B., and Pietquin, O. Softened approximate policy iteration for markov games. In International Conference on Machine Learning (ICML), 2016.
    Google ScholarLocate open access versionFindings
  • Piot, B., Geist, M., and Pietquin, O. Di erence of convex functions programming for reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2519–2527, 2014.
    Google ScholarLocate open access versionFindings
  • Puterman, M. L. Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
    Google ScholarFindings
  • Puterman, M. L. and Shin, M. C. Modified policy iteration algorithms for discounted markov decision problems. Management Science, 24(11):1127–1137, 1978.
    Google ScholarLocate open access versionFindings
  • Scherrer, B. and Lesner, B. On the use of non-stationary policies for stationary infinitehorizon markov decision processes. In Advances in Neural Information Processing Systems (NeurIPS), 2012.
    Google ScholarLocate open access versionFindings
  • Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B., and Geist, M. Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research (JMLR), 16:1629–1676, 2015.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning (ICML), 2015.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Chen, X., and Abbeel, P. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017.
    Findings
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Shani, L., Efroni, Y., and Mannor, S. Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. In AAAI Conference on Artificial Intelligence, 2020.
    Google ScholarLocate open access versionFindings
  • Song, Z., Parr, R., and Carin, L. Revisiting the softmax bellman operator: New benefits and new perspective. In International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • Vieillard, N., Pietquin, O., and Geist, M. Munchausen Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
    Google ScholarLocate open access versionFindings
  • Vieillard, N., Scherrer, B., Pietquin, O., and Geist, M. Momentum in reinforcement learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments