# Monte-Carlo Tree Search as Regularized Policy Optimization

ICML, pp. 3769-3778, 2020.

EI

Keywords:

Monte-Carlo tree searchdeep reinforcement learningpolicy gradientmonte carlo tree searchmarkov decision processMore(4+)

Weibo:

Abstract:

The combination of Monte-Carlo tree search (MCTS) with deep reinforcement learning has led to significant advances in artificial intelligence. However, AlphaZero, the current state-of-the-art MCTS algorithm, still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero's search heuristi...More

Code:

Data:

Introduction

- Policy gradient is at the core of many state-of-the-art deep reinforcement learning (RL) algorithms.
- Among them trust region policy optimization is a prominent example (Schulman et al, 2015; 2017; Abdolmaleki et al, 2018; Song et al, 2019)
- These algorithmic enhancements have led to significant performance gains in various benchmark domains (Song et al, 2019).

Highlights

- Policy gradient is at the core of many state-of-the-art deep reinforcement learning (RL) algorithms
- In Section 3, we show that AlphaZero computes approximate solutions to a family of regularized policy optimization problems
- Monte-Carlo tree search as regularized policy optimization In Section 2, we presented AlphaZero that relies on modelbased planning
- Using the same reasoning as in Section 3.3, we show that this modified UCT formula tracks the solution to a regularized policy optimization problem, generalizing our result to commonly used Monte-Carlo tree search algorithms
- We aim to address several questions: (1) How sensitive are state-of-the-art hybrid algorithms such as AlphaZero to low simulation budgets and can the ALL variant provide a more robust alternative? (2) What changes among ACT, SEARCH, and learning target are most critical in this variant performance? (3) How does the performance of the ALL variant compare with AlphaZero in environments with large branching factors?
- We showed that the action selection formula used in Monte-Carlo tree search algorithms, most notably AlphaZero, approximates the solution to a regularized policy optimization problem formulated with search Q-values

Methods

- (2) What changes among ACT, SEARCH, and LEARN are most critical in this variant performance?
- As a variant of AlphaZero, MuZero applies tree search in learned models instead of real environments, which makes it applicable to a wider range of problems.
- Since MuZero shares the same search procedure as AlphaGo, AlphaGo Zero, and AlphaZero, the authors expect the performance gains to be transferable to these algorithms.
- Since πapproximates π, AlphaZero and other MCTS algorithms can be interpreted as approximate regularized policy optimization

Conclusion

- The authors showed that the action selection formula used in MCTS algorithms, most notably AlphaZero, approximates the solution to a regularized policy optimization problem formulated with search Q-values.
- The authors' analysis on the behavior of model-based algorithms (i.e., MCTS) has made explicit connections to model-free algorithms.
- The authors hope that this sheds light on new ways of combining both paradigms and opens doors to future ideas and improvements

Summary

## Introduction:

Policy gradient is at the core of many state-of-the-art deep reinforcement learning (RL) algorithms.- Among them trust region policy optimization is a prominent example (Schulman et al, 2015; 2017; Abdolmaleki et al, 2018; Song et al, 2019)
- These algorithmic enhancements have led to significant performance gains in various benchmark domains (Song et al, 2019).
## Methods:

(2) What changes among ACT, SEARCH, and LEARN are most critical in this variant performance?- As a variant of AlphaZero, MuZero applies tree search in learned models instead of real environments, which makes it applicable to a wider range of problems.
- Since MuZero shares the same search procedure as AlphaGo, AlphaGo Zero, and AlphaZero, the authors expect the performance gains to be transferable to these algorithms.
- Since πapproximates π, AlphaZero and other MCTS algorithms can be interpreted as approximate regularized policy optimization
## Conclusion:

The authors showed that the action selection formula used in MCTS algorithms, most notably AlphaZero, approximates the solution to a regularized policy optimization problem formulated with search Q-values.- The authors' analysis on the behavior of model-based algorithms (i.e., MCTS) has made explicit connections to model-free algorithms.
- The authors hope that this sheds light on new ways of combining both paradigms and opens doors to future ideas and improvements

Reference

- Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. (2018). Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920.
- Andrychowicz, O. M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. (2020). Learning dexterous inhand manipulation. The International Journal of Robotics Research, 39(1):3–20.
- Anthony, T., Nishihara, R., Moritz, P., Salimans, T., and Schulman, J. (2019). Policy gradient search: Online planning and expert iteration without search trees. arXiv preprint arXiv:1904.03646.
- Auer, P. (2002). Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
- Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB, D., Muldal, A., Heess, N., and Lillicrap, T. (2018). Distributional policy gradients. In International Conference on Learning Representations.
- Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279.
- Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge university press.
- Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., and Colton, S. (2012). A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43.
- Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122.
- Csiszar, I. (1964). Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akad. Mat. Kutato Int. Koezl., 8:85–108.
- Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and Zhokhov, P. (2017). Openai baselines.
- Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., and Coppin, B. (2015). Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679.
- Farquhar, G., Rocktaschel, T., Igl, M., and Whiteson, S. (2017). TreeQN and ATreeC: Differentiable treestructured models for deep reinforcement learning. arXiv preprint arXiv:1710.11417.
- Fox, R., Pakman, A., and Tishby, N. (2015). Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562.
- Geist, M., Scherrer, B., and Pietquin, O. (2019). A theory of regularized markov decision processes. arXiv preprint arXiv:1901.11275.
- Google (2020). Cloud TPU — Google Cloud. https://cloud.google.com/tpu/.
- Grill, J.-B., Domingues, O. D., Menard, P., Munos, R., and Valko, M. (2019). Planning in entropy-regularized Markov decision processes and games. In Neural Information Processing Systems.
- Guez, A., Weber, T., Antonoglou, I., Simonyan, K., Vinyals, O., Wierstra, D., Munos, R., and Silver, D. (2018). Learning to search with mctsnets. arXiv preprint arXiv:1802.04697.
- Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org.
- Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Pfaff, T., Weber, T., Buesing, L., and Battaglia, P. W. (2020). Combining Q-learning and search with amortized value estimates. In International Conference on Learning Representations.
- He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., and Silver, D. (2018). Distributed prioritized experience replay. In International Conference on Learning Representations.
- Kocsis, L. and Szepesvari, C. (2006). Bandit based MonteCarlo planning. In European conference on machine learning, pages 282–293. Springer.
- Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909.
- Liese, F. and Vajda, I. (2006). On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412.
- Metz, L., Ibarz, J., Jaitly, N., and Davidson, J. (2017). Discrete sequential prediction of continuous actions for deep rl. arXiv preprint arXiv:1705.05035.
- Neu, G., Jonsson, A., and Gomez, V. (2017). A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798.
- O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016). Combining policy gradient and Q-learning. arXiv preprint arXiv:1611.01626.
- Oh, J., Singh, S., and Lee, H. (2017). Value prediction network. In Advances in Neural Information Processing Systems, pages 6118–6128.
- Rosin, C. D. (2011). Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230.
- Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al. (2019). Mastering Atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning, pages 1889–1897.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484.
- Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2017a). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.
- Song, H. F., Abdolmaleki, A., Springenberg, J. T., Clark, A., Soyer, H., Rae, J. W., Noury, S., Ahuja, A., Liu, S., Tirumala, D., et al. (2019). V-MPO: On-policy maximum a posteriori policy optimization for discrete and continuous control. arXiv preprint arXiv:1909.12238.
- Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.
- Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
- Tang, Y. and Agrawal, S. (2019). Discretizing continuous action space for on-policy optimization. arXiv preprint arXiv:1901.10500.
- Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al. (2018). DeepMind control suite. arXiv preprint arXiv:1801.00690.
- Van de Wiele, T., Warde-Farley, D., Mnih, A., and Mnih, V. (2020). Q-learning in enormous action spaces via amortized approximate maximization. arXiv preprint arXiv:2001.08116.
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
- Ziebart, B. D. (2010). Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University, USA.
- Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017b). Mastering the game of go without human knowledge. Nature, 550(7676):354–359.
- Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., et al. (2017c). The predictron: End-toend learning and planning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3191–3199. JMLR. org.

Tags

Comments