# On the role of planning in model-based deep reinforcement learning

Keywords:

reinforcement learningmodel-predictive controlModel-based reinforcement learningmonte carlo tree searchdepth we search within the treeMore(8+)

Weibo:

Abstract:

Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to tr...More

Code:

Data:

Introduction

- AND RELATED WORK

Model-based reinforcement learning (MBRL) [9, 26, 47, 49, 74] involves both learning and planning. - MBRL methods can be broadly classified into decisiontime planning, which use the model to select actions, and background planning, which use the model to update a policy [68].
- Decisiontime planning methods often feature robustness to uncertainty and fast adaptation to new scenarios [e.g., 76], though may be insufficient in settings which require longterm reasoning such as in sparse reward tasks or strategic games like Go. Dyna [67] is a classic background planning method which uses the the model to simulate data on which to train a policy via standard modelfree methods like Q-learning or policy gradient.
- Background planning methods often feature improved data efficiency over model-free methods [e.g., 34], but exhibit the same drawbacks as model-free approaches such as brittleness to out-of-distribution experience at test time

Highlights

- AND RELATED WORK

Model-based reinforcement learning (MBRL) [9, 26, 47, 49, 74] involves both learning and planning - We systematically study the role of planning and its algorithmic design choices in a recent state-of-the-art MBRL algorithm, MuZero [58]
- Model-predictive control (MPC) [11] is a classic decision-time planning method that uses the model to optimize a sequence of actions starting from the current environment state
- We evaluated MuZero on eight tasks across five domains, selected to include popular MBRL environments with a wide range of characteristics including episode length, reward sparsity, and variation of initial conditions
- Because much work in MBRL focuses on continuous control [e.g., 74], we included three tasks from the DeepMind Control Suite [70]: Acrobot (Sparse Swingup), Cheetah (Run), and Humanoid (Stand)
- A major takeaway from this work is that while search is useful for learning, simple and shallow forms of planning may be sufficient. This has important implications in terms of computational efficiency: the algorithm with DUCT = 1 can be implemented without trees and is far easier to parallelize than Monte-Carlo tree search (MCTS), and the algorithm with depth we search within the tree (Dtree) = 1 can be implemented via model-free techniques [e.g., 1], suggesting that MBRL may not be necessary at all for strong final performance in some domains

Results

- The authors evaluated MuZero on eight tasks across five domains, selected to include popular MBRL environments with a wide range of characteristics including episode length, reward sparsity, and variation of initial conditions.
- The authors discretized the action space of the control tasks as in Tang & Agrawal [69], Grill et al [20]
- Three of these environments exhibit some amount of stochasticity and partial observability: the movement of ghosts in Minipacman is stochastic; Go is a two-player game and stochastic from the point of view of each player independently; and using a limited number of observation frames in Atari makes it partially observable.

Conclusion

- The authors explored the role of planning in MuZero [58] through a number of ablations and modifications.
- A major takeaway from this work is that while search is useful for learning, simple and shallow forms of planning may be sufficient
- This has important implications in terms of computational efficiency: the algorithm with DUCT = 1 can be implemented without trees and is far easier to parallelize than MCTS, and the algorithm with Dtree = 1 can be implemented via model-free techniques [e.g., 1], suggesting that MBRL may not be necessary at all for strong final performance in some domains.
- Given that search seems to provide minimal improvements at evaluation in many standard RL environments, it may be computationally prudent to avoid using search altogether at test time

Summary

## Introduction:

AND RELATED WORK

Model-based reinforcement learning (MBRL) [9, 26, 47, 49, 74] involves both learning and planning.- MBRL methods can be broadly classified into decisiontime planning, which use the model to select actions, and background planning, which use the model to update a policy [68].
- Decisiontime planning methods often feature robustness to uncertainty and fast adaptation to new scenarios [e.g., 76], though may be insufficient in settings which require longterm reasoning such as in sparse reward tasks or strategic games like Go. Dyna [67] is a classic background planning method which uses the the model to simulate data on which to train a policy via standard modelfree methods like Q-learning or policy gradient.
- Background planning methods often feature improved data efficiency over model-free methods [e.g., 34], but exhibit the same drawbacks as model-free approaches such as brittleness to out-of-distribution experience at test time
## Objectives:

The aim of this paper is to assess the strengths and weaknesses of recent advances in MBRL to help clarify the state of the field.## Results:

The authors evaluated MuZero on eight tasks across five domains, selected to include popular MBRL environments with a wide range of characteristics including episode length, reward sparsity, and variation of initial conditions.- The authors discretized the action space of the control tasks as in Tang & Agrawal [69], Grill et al [20]
- Three of these environments exhibit some amount of stochasticity and partial observability: the movement of ghosts in Minipacman is stochastic; Go is a two-player game and stochastic from the point of view of each player independently; and using a limited number of observation frames in Atari makes it partially observable.
## Conclusion:

The authors explored the role of planning in MuZero [58] through a number of ablations and modifications.- A major takeaway from this work is that while search is useful for learning, simple and shallow forms of planning may be sufficient
- This has important implications in terms of computational efficiency: the algorithm with DUCT = 1 can be implemented without trees and is far easier to parallelize than MCTS, and the algorithm with Dtree = 1 can be implemented via model-free techniques [e.g., 1], suggesting that MBRL may not be necessary at all for strong final performance in some domains.
- Given that search seems to provide minimal improvements at evaluation in many standard RL environments, it may be computationally prudent to avoid using search altogether at test time

- Table1: Shared hyperparameters
- Table2: Hyperparameters for Minipacman
- Table3: Hyperparameters for Atari
- Table4: Hyperparameters for control suite
- Table5: Hyperparameters for Sokoban
- Table6: Hyperparameters for Go
- Table7: Values obtained by the baseline vanilla MuZero agent (corresponding to the “Learn+Data+Eval” agent in Figure 3), computed from the average of the last 10% of scores seen during training. Shown are the median across ten seeds, as well as the worst and best seeds. Median values are used to normalize the results in Figure 3
- Table8: Values obtained by MuZero at the very start of training (i.e., with a randomly initialized policy). Values are computed from the average of the first 1% of scores seen during training. Shown are the median across ten seeds, as well as the worst and best seeds. Median values are used to normalize the results in Figure 3
- Table9: Values obtained by a version of MuZero that uses no search at evaluation time (corresponding to the “Learn+Data” agent in Figure 3). Shown are the median across ten seeds, as well as the worst and best seeds. Median values are used to normalize the results in Figure 4
- Table10: Values obtained by a baseline vanilla MuZero agent, evaluated offline from a checkpoint saved at the very end of training. For each seed, values are the average over 50 (control tasks and Atari) or 1000 episodes (Minipacman and Sokoban). These values are used to normalize the results in Figure 5 and Figure 6. Note that for Minipacman, the scores reported here are for agents that were both trained and tested on either the in-distribution mazes or the out-of-distribution mazes. Shown are the median across ten seeds, as well as the worst and best seeds
- Table11: Values in Figure 3. Each column shows scores where 0 corresponds to the reward obtained by a randomly initialized agent (Table 8) and 100 corresponds to full MuZero (“Learn+Data+Eval”, Table 7)
- Table12: Effect of the different contributions of search, modeled as Reward ∼ Environment + TrainUpdate * TrainAct + TestAct over N = 360 data points, using the levels for each variable as defined in the table in Figure 3. This ANOVA indicates that both the environment, model-based learning, model-based acting during training, and model-based acting during testing are all significant predictors of reward. We did not detect an interaction between model-based learning and model-based acting during learning
- Table13: Effect of tree depth, Dtree, modeled as Reward ∼ Environment * log(Dtree) over N = 375 data points. Where Dtree = ∞, we used the value for the maximum possible depth (i.e. the search budget). Top: this ANOVA indicates that both the environment and tree depth are significant predictors of reward, and that there is an interaction between environment and tree depth. Bottom: individual Spearman rank correlations between reward and log(Dtree) for each environment. p-values are adjusted for multiple comparisons using the Bonferroni correction
- Table14: Effect of exploration vs. exploitation depth, DUCT, modeled as Reward ∼ Environment * log(DUCT) over N = 375 data points. Where DUCT = ∞, we used the value for the maximum possible depth (i.e. the search budget). Top: this ANOVA indicates that neither the environment nor exploration vs. exploitation depth are significant predictors of reward. Bottom: individual Spearman rank correlations between reward and log(DUCT) for each environment. p-values are adjusted for multiple comparisons using the Bonferroni correction. The main effects are primarily driven by Go
- Table15: Effect of the training search budget, B, on the strength of the policy prior, modeled as Reward ∼ Environment * log(B) + log(B)2 over N = 360 data points. Top: this ANOVA indicates that the environment and budget are significant predictors of reward, and that there is a second-order effect of the search budget, indicating that performance goes down with too many simulations. Additionally, there is an interaction between environment and budget. Bottom: individual Spearman rank correlations between reward and log(B) for each environment. p-values are adjusted for multiple comparisons using the Bonferroni correction. Note that the correlation for Go does not include values for B > 50 (and thus is largely flat, since Go does not learn for small values of B)
- Table16: Effect the evaluation search budget, B, on generalization reward when using the learned model with MCTS, modeled as Reward ∼ Environment * log(B) over N = 300 data points. Top: this ANOVA indicates that the environment and budget are significant predictors of reward, and that there is an interaction between environment and budget. Bottom: individual Spearman rank correlations between reward and log(B) for each environment. p-values are adjusted for multiple comparisons using the Bonferroni correction
- Table17: Effect the evaluation search budget, B, on generalization reward when using the simulator with MCTS, modeled as Reward ∼ Environment * log(B) over N = 300 data points. Top: this ANOVA indicates that the environment and budget are significant predictors of reward, and that there is an interaction between environment and budget. Bottom: individual Spearman rank correlations between reward and log(B) for each environment. p-values are adjusted for multiple comparisons using the Bonferroni correction
- Table18: Rank correlations between the search budget, B, and generalization reward in Minipacman for different types of mazes and models. p-values are adjusted for multiple comparisons using the Bonferroni correction
- Table19: Effect the evaluation search budget (B), the number of unique training mazes (M ), and test level on generalization reward in Minipacman when using the simulator with MCTS, modeled as Reward ∼ log(M ) * log(B) + Test Level over N = 180 data points. This ANOVA indicates that the both the number of training mazes and the search budget are significant predictors of reward, and that there is an interaction between them

Funding

- Search solely to compute policy updates (“Learn”) improves performance to 68.5% (N = 80)
- Allowing the agent to both learn and act via search during training (“Learn+Data”) further improves performance to a median strength of 90.3% (N = 75)
- As before, we find a small but significant improvement in performance of 6.6 percentage points between full MuZero and agents which do not use search at all (t = −5.11, p < 0.001, N = 60)
- The median reward obtained across environments at 625 simulations is also less than the baseline by a median of 3.5 percentage points (t = −4.29, p < 0.001, N = 60), possibly indicating an effect of compounding model errors
- The simulator allows for somewhat better performance, with greater improvements for small numbers of train mazes (t = −7.71, p < 0.001, N = 180, see also Table 18 and 19), indicating the ability of search to help with some amount of distribution shift when using an accurate model
- Reward obtained by the simulator decreases at 3125 simulations compared to 125 simulations (t = −3.56, p = 0.002, N1 = 10, N2 = 10), again indicating a sensitivity to errors in the value function and policy prior

Study subjects and analysis

data: 70

Figure 3 shows the results. Across environments, the “One-Step” variant has a median strength of 62.8% (N = 70). Although this variant is not entirely model-free, it does remove much of the dependence on the model, thus establishing a useful minimal-planning baseline to compare against

data: 80

0 2 3 # Si5mula1t0ions20 50 2 3 # Si5mula1t0ions20 50 2 3 # Si5mula1t0ions20 50. search solely to compute policy updates (“Learn”) improves performance to 68.5% (N = 80). The “Data” variant—where search is only used to select actions—also improves over “One-Step” to 75.7% (N = 60)

data: 60

search solely to compute policy updates (“Learn”) improves performance to 68.5% (N = 80). The “Data” variant—where search is only used to select actions—also improves over “One-Step” to 75.7% (N = 60). This result indicates that search can additionally drive performance by enabling the agent to learn from a different state distribution resulting from better actions, echoing other recent work leveraging planning for exploration [45, 60]

data: 75

This result indicates that search can additionally drive performance by enabling the agent to learn from a different state distribution resulting from better actions, echoing other recent work leveraging planning for exploration [45, 60]. Allowing the agent to both learn and act via search during training (“Learn+Data”) further improves performance to a median strength of 90.3% (N = 75). Search at evaluation adds a final increase of 9.8 percentage points

Reference

- Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations (ICLR), 2018.
- Alekh Agarwal, Nan Jiang, and Sham M Kakade. Reinforcement learning: Theory and algorithms. Technical report, Technical Report, Department of Computer Science, University of Washington, 2019.
- Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems, pp. 5360–5370, 2017.
- Thomas Anthony, Robert Nishihara, Philipp Moritz, Tim Salimans, and John Schulman. Policy gradient search: Online planning and expert iteration without search trees. arXiv preprint arXiv:1904.03646, 2019.
- Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Zachary C Lipton, and Animashree Anandkumar. Surprising negative results for generative adversarial tree search. arXiv preprint arXiv:1806.05780, 2018.
- Victor Bapst, Alvaro Sanchez-Gonzalez, Carl Doersch, Kimberly L Stachenfeld, Pushmeet Kohli, Peter W Battaglia, and Jessica B Hamrick. Structured agents for physical construction. In International conference on machine learning (ICML), 2019.
- Petr Baudisand Jean-loup Gailly. Pachi: State of the art open source go program. In Advances in computer games, pp. 24–38.
- Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Cameron B. Browne, Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
- Arunkumar Byravan, Jost Tobias Springenberg, Abbas Abdolmaleki, Roland Hafner, Michael Neunert, Thomas Lampe, Noah Siegel, Nicolas Heess, and Martin Riedmiller. Imagined value gradients: Model-based policy optimization with tranferable latent dynamics models. In Conference on Robot Learning, pp. 566–589, 2020.
- Eduardo F Camacho and Carlos Bordons Alba. Model predictive control. Springer Science & Business Media, 2013.
- Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765, 2018.
- Remi Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. In International conference on computers and games, pp. 72–83.
- Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889–904, 1995.
- Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In International Conference on machine learning (ICML), pp. 465–472, 2011.
- Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.
- Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Multiple-step greedy policies in approximate and online reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5238–5247, 2018.
- Yonathan Efroni, Mohammad Ghavamzadeh, and Shie Mannor. Multi-step greedy and approximate real time dynamic programming. arXiv preprint arXiv:1909.04236, 2019.
- Michael Fairbank. Reinforcement learning by value gradients. arXiv preprint arXiv:0803.3539, 2008.
- Jean-Bastien Grill, Florent Altche, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Remi Munos. Monte-Carlo tree search as regularized policy optimization. In International conference on machine learning (ICML), 2020.
- Christopher Grimm, Andre Barreto, Satinder Singh, and David Silver. The value equivalence principle for model-based reinforcement learning. Advances in Neural Information Processing Systems, 33, 2020.
- Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sebastien Racaniere, Theophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, and Timothy Lillicrap. An investigation of model-free planning. In International conference on machine learning (ICML), 2019.
- X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Rree Search Planning. In Advances in Neural Information Processing Systems, pp. 3338–3346, 2014.
- David Ha and Jurgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, 2018.
- Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.
- Jessica B Hamrick. Analogues of mental simulation and imagination in deep learning. Current Opinion in Behavioral Sciences, 29:8–16, 2019.
- Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Tobias Pfaff, Theophane Weber, Lars Buesing, and Peter W. Battaglia. Combining Q-learning and search with amortized value estimates. In International Conference on Learning Representations (ICLR), 2020.
- Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. Neuroscience-inspired artificial intelligence. Neuron, 95(2):245–258, 2017.
- Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pp. 2944–2952, 2015.
- Mark K. Ho, David Abel, Thomas L. Griffiths, and Michael L. Littman. The value of abstraction. Current Opinion in Behavioral Sciences, 29:111–116, October 2019.
- Matt Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Feryal Behbahani, Tamara Norman, Abbas Abdolmaleki, Albin Cassirer, Fan Yang, Kate Baumli, et al. Acme: A research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020.
- G Zacharias Holland, Erin J Talvitie, and Michael Bowling. The effect of planning shape on dyna-style planning in high-dimensional state spaces. arXiv preprint arXiv:1806.01825, 2018.
- Ronald A Howard. Dynamic programming and markov processes. 1960.
- Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pp. 12519–12530, 2019.
- Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pp. 1181–1189, 2015.
- Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, and Henryk Michalewski. Model based reinforcement learning for atari. In International Conference on Learning Representations (ICLR), 2020.
- Ken Kansky, Tom Silver, David A Mely, Mohamed Eldawy, Miguel Lazaro-Gredilla, Xinghua Lou, Nimrod Dorfman, Szymon Sidor, Scott Phoenix, and Dileep George. Schema networks: Zero-shot transfer with a generative causal model of intuitive physics. In International conference on machine learning (ICML), 2017.
- Levente Kocsis and Csaba Szepesvari. Bandit based Monte-Carlo planning. In European conference on machine learning, pp. 282–293.
- Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. In International Conference on Learning Representations (ICLR), 2018.
- Michail G Lagoudakis and Ronald Parr. Reinforcement learning as classification: Leveraging modern classifiers. In International Conference on Machine Learning (ICML), pp. 424–431, 2003.
- Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017.
- Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Perolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Ewalds, Ryan Faulkner, Janos Kramar, Bart De Vylder, Brennan Saeta, James Bradbury, David Ding, Sebastian Borgeaud, Matthew Lai, Julian Schrittwieser, Thomas Anthony, Edward Hughes, Ivo Danihelka, and Jonah RyanDavis. OpenSpiel: A framework for reinforcement learning in games. CoRR, abs/1908.09453, 2019.
- Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071– 1079, 2014.
- Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.
- Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan online, learn offline: Efficient learning and exploration via model-based control. In International Conference on Learning Representations (ICLR), 2019.
- Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations (ICLR), 2019.
- Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. Model-based reinforcement learning: A survey. arXiv preprint arXiv:2006.16712, 2020.
- Igor Mordatch, Kendall Lowrey, Galen Andrew, Zoran Popovic, and Emanuel V Todorov. Interactive control of diverse complex characters with neural networks. In Advances in Neural Information Processing Systems, pp. 3132–3140, 2015.
- Remi Munos. From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends in Machine Learning, 7(1):1–130, 2014.
- Duy Nguyen-Tuong and Jan Peters. Model learning for robot control: a survey. Cognitive processing, 12(4):319–340, 2011.
- Sebastien Racaniere, Theophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomenech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. In Advances in neural information processing systems, pp. 5690–5701, 2017.
- Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. A game theoretic framework for model based reinforcement learning. In International conference on machine learning (ICML), 2020.
- Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.
- Bruno Scherrer. Approximate policy iteration schemes: a comparison. In International Conference on Machine Learning, pp. 1314–1322, 2014.
- Jurgen Schmidhuber. Making the world differentiable: On using self-supervised fully recurrent n eu al networks for dynamic reinforcement learning and planning in non-stationary environm nts. 1990.
- Jurgen Schmidhuber. Curious model-building control systems. In Proc. international joint conference on neural networks, pp. 1458–1463, 1991.
- Jurgen Schmidhuber. On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models. arXiv preprint arXiv:1511.09249, 2015.
- Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019.
- John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning (ICML), pp. 1889– 1897, 2015.
- Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. In International conference on machine learning (ICML), 2020.
- David Silver, Richard S Sutton, and Martin Muller. Sample-based learning and search with permanent and transient memories. In Proceedings of the 25th international conference on Machine learning, pp. 968–975, 2008.
- David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529 (7587):484–489, 2016.
- David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362 (6419):1140–1144, 2018.
- Jost Tobias Springenberg, Nicolas Heess, Daniel Mankowitz, Josh Merel, Arunkumar Byravan, Abbas Abdolmaleki, Jackie Kay, Jonas Degrave, Julian Schrittwieser, Yuval Tassa, et al. Local search for policy iteration in continuous control. arXiv preprint arXiv:2010.05545, 2020.
- Wen Sun, Geoffrey J Gordon, Byron Boots, and J Bagnell. Dual policy iteration. In Advances in Neural Information Processing Systems, pp. 7059–7069, 2018.
- Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
- Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Yunhao Tang and Shipra Agrawal. Discretizing continuous action space for on-policy optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
- Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Gerald Tesauro and Gregory R Galperin. On-line policy improvement using Monte-Carlo search. In Advances in Neural Information Processing Systems, pp. 1068–1074, 1997.
- Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. arXiv preprint arXiv:2005.09814, 2020.
- Hado P van Hasselt, Matteo Hessel, and John Aslanides. When to use parametric models in reinforcement learning? In Advances in Neural Information Processing Systems, pp. 14322– 14333, 2019.
- Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057, 2019.
- Theophane Weber, Nicolas Heess, Lars Buesing, and David Silver. Credit assignment techniques in stochastic computation graphs. arXiv preprint arXiv:1901.01761, 2019.
- Michael C Yip and David B Camarillo. Model-less feedback control of continuum manipulators in constrained environments. IEEE Transactions on Robotics, 30(4):880–889, 2014.

Tags

Comments