Critic Regularized Regression

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views37|Links
EI
Keywords:
real worlddeep reinforcement learningBehavior cloningbatch deep reinforcement learningmonotonic advantage re-weighted imitation learningMore(16+)
Weibo:
We propose a novel offline reinforcement learning algorithm to learn policies from data using a form of critic-regularized regression

Abstract:

Offline reinforcement learning (RL), also known as batch RL, offers the prospect of policy optimization from large pre-recorded datasets without online environment interaction. It addresses challenges with regard to the cost of data collection and safety, both of which are particularly pertinent to real-world applications of RL. Unfortuna...More

Code:

Data:

0
Introduction
  • Deep reinforcement learning (RL) algorithms have succeeded in a number of challenging domains.
  • One important reason is that online execution of policies during learning, which the authors refer to as online RL, is often not feasible or desirable because of cost, safety and ethics [8].
  • This is clearly the case in healthcare, industrial control and robotics.
  • This has led to a resurgence of interest in offline RL methods, known as batch RL [19], which aim to learn policies from logged data without further interaction with the real system
Highlights
  • Deep reinforcement learning (RL) algorithms have succeeded in a number of challenging domains
  • We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR)
  • Despite the apparent simplicity of CRR, our experiments show that it outperforms several state-ofthe-art offline RL algorithms by a significant margin on a wide range of benchmark tasks
  • In the Appendix, we show that in the tabular setting, CRR is safe in the sense that it restricts the action choice to the support of the data, and can be interpreted as implementing a policy iteration scheme that improves upon the behavior policy
  • We have presented an algorithm for offline RL that is simpler than existing methods but leads to surprisingly good performance even on challenging tasks
  • This paper introduces a new algorithm that could lead to improved performance on some real world tasks
Methods
  • CRR, on a number of challenging simulated manipulation and locomotion domains.
  • The authors' results demonstrate that CRR works well even in these challenging settings and that it outperforms previously published approaches, in some cases by a considerable margin.
  • The authors perform several ablations that highlight the importance of individual algorithm components, and provide results on some toy domains that provide insight on why alternative approaches may fail.
  • The authors experiment with the continuous control tasks introduced in RL Unplugged (RLU) [3].
  • All simulations are conducted using MuJoCo [36]; illustrations of the environments are given in Fig. 2
Conclusion
  • The authors have presented an algorithm for offline RL that is simpler than existing methods but leads to surprisingly good performance even on challenging tasks.
  • CRR exp performs especially well across the entire range of tasks considered.
  • Given the already promising performance of CRR the authors believe that studying the underlying dynamics further is a valuable direction for future work and has the potential to reveal further algorithmic improvements that may push the frontier of offline RL algorithms in terms of robustness, performance and simplicity
Summary
  • Introduction:

    Deep reinforcement learning (RL) algorithms have succeeded in a number of challenging domains.
  • One important reason is that online execution of policies during learning, which the authors refer to as online RL, is often not feasible or desirable because of cost, safety and ethics [8].
  • This is clearly the case in healthcare, industrial control and robotics.
  • This has led to a resurgence of interest in offline RL methods, known as batch RL [19], which aim to learn policies from logged data without further interaction with the real system
  • Objectives:

    The authors aim to train π by discouraging it from taking actions that are outside the training distribution.
  • Methods:

    CRR, on a number of challenging simulated manipulation and locomotion domains.
  • The authors' results demonstrate that CRR works well even in these challenging settings and that it outperforms previously published approaches, in some cases by a considerable margin.
  • The authors perform several ablations that highlight the importance of individual algorithm components, and provide results on some toy domains that provide insight on why alternative approaches may fail.
  • The authors experiment with the continuous control tasks introduced in RL Unplugged (RLU) [3].
  • All simulations are conducted using MuJoCo [36]; illustrations of the environments are given in Fig. 2
  • Conclusion:

    The authors have presented an algorithm for offline RL that is simpler than existing methods but leads to surprisingly good performance even on challenging tasks.
  • CRR exp performs especially well across the entire range of tasks considered.
  • Given the already promising performance of CRR the authors believe that studying the underlying dynamics further is a valuable direction for future work and has the potential to reveal further algorithmic improvements that may push the frontier of offline RL algorithms in terms of robustness, performance and simplicity
Tables
  • Table1: From these we consider Amean but found that for small m it may overestimate the advantage due to stochasticity. For f as in Eq (3) this could lead to sub-optimal actions being included. We therefore also consider Amax; a pessimistic estimate of the advantage. Advantage estimates considered by different algorithms
  • Table2: Results on Deepmind Control suite. We divide the Deepmind control suite environments into two rough categories: easy (first 6) and hard (last 3)
  • Table3: Results on Locomotion Suite. The first 3 tasks can be solved by feedforward agents; the corresponding datasets are not sequential. The last 4 tasks necessitate observation histories and all agents here are recurrent
  • Table4: ABM specific hyper-parameters
  • Table5: Table 5
  • Table6: Results on manipulation environments
  • Table7: Results on locomotion environments
Download tables as Excel
Related work
  • For comprehensive in-depth reviews of offline RL, we refer the reader to Lange et al [19] and Levine et al [20]. The latter provides and extensive and very recent appraisal of the field.

    Behavior cloning (BC) [29] is the simplest form of offline learning. Starting from a dataset of state-action pairs, a policy is trained to map states to actions via a supervised loss. This approach can be surprisingly effective when the dataset contains high-quality data, e.g. trajectories generated by an expert for the task of interest; see Merel et al [22] for a large scale application. However, it can easily fail (i) when the dataset contains a large proportion of random or otherwise task irrelevant behavior; or (ii) when the learned policy induces a trajectory distribution that deviates far from that of the dataset under consideration [30].
Reference
  • Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. Striving for simplicity in off-policy deep reinforcement learning. Preprint arXiv:1907.04543, 2019.
    Findings
  • Anonymous. Rl unplugged: A suite of benchmarks for offline reinforcement learning. In Submitted to NeurIPS 2020, 2020.
    Google ScholarFindings
  • Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributional policy gradients. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458, 2017.
    Google ScholarLocate open access versionFindings
  • Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu Wang. Scaling data-driven robotics with reward sketching and batch reinforcement learning. Preprint arXiv:1909.12200, 2019.
    Findings
  • Xinyue Chen, Zijian Zhou, Zheng Wang, Che Wang, Yanqiu Wu, Qing Deng, and Keith Ross. BAIL: Best-action imitation learning for batch deep reinforcement learning. Preprint arXiv:1910.12179, 2019.
    Findings
  • Gabriel Dulac-Arnold, Daniel J. Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. Preprint arXiv:1904.12901, 2019.
    Findings
  • Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau. Benchmarking batch deep reinforcement learning algorithms. Preprint arXiv:1910.01708, 2019.
    Findings
  • Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062, 2019.
    Google ScholarLocate open access versionFindings
  • Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of discrete distributions. In 2015 IEEE International Symposium on Information Theory (ISIT), pages 2291–2295. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Conference on Neural Information Processing Systems, pages 2944–2952, 2015.
    Google ScholarLocate open access versionFindings
  • Nicolas Heess, Dhruva Tirumala, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin A. Riedmiller, and David Silver. Emergence of locomotion behaviours in rich environments. Preprint arXiv:1707.02286, 2017.
    Findings
  • Matt Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Feryal Behbahani, Tamara Norman, Abbas Abdolmaleki, Albin Cassirer, Fan Yang, Kate Baumli, Sarah Henderson, Alex Novikov, Sergio Gómez Colmenarejo, Serkan Cabi, Caglar Gulcehre, Tom Le Paine, Andrew Cowie, Ziyu Wang, Bilal Piot, and Nando de Freitas. Acme: A research framework for distributed reinforcement learning. Preprint arXiv:2006.00979, 2020.
    Findings
  • Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Àgata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind W. Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. Preprint arXiv:1907.00456, 2019.
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing offpolicy Q-learning via bootstrapping error reduction. In Conference on Neural Information Processing Systems, pages 11761–11771, 2019.
    Google ScholarLocate open access versionFindings
  • Aviral Kumar, Xue Bin Peng, and Sergey Levine. Reward-conditioned policies. Preprint arXiv:1912.13465, 2019.
    Findings
  • Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Marco Wiering and Martijn van Otterlo, editors, Reinforcement Learning: State-of-the-Art, pages 45–73. Springer Berlin Heidelberg, 2012.
    Google ScholarLocate open access versionFindings
  • Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint arXiv:2005.01643, 2020.
    Findings
  • Josh Merel, Arun Ahuja, Vu Pham, Saran Tunyasuvunakool, Siqi Liu, Dhruva Tirumala, Nicolas Heess, and Greg Wayne. Hierarchical visuomotor control of humanoids. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Josh Merel, Diego Aldarondo, Jesse Marshall, Yuval Tassa, Greg Wayne, and Bence Ölveczky. Deep neuroethology of a virtual rodent. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359.
    Findings
  • Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. Preprint arXiv:1709.10089, 2017.
    Findings
  • Gerhard Neumann and Jan R. Peters. Fitted Q-iteration by advantage weighted regression. In Conference on Neural Information Processing Systems. 2009.
    Google ScholarLocate open access versionFindings
  • Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. Preprint arXiv:1910.00177, 2019.
    Findings
  • Dean A Pomerleau. ALVINN: An autonomous land vehicle in a neural network. In Conference on Neural Information Processing Systems, pages 305–313, 1989.
    Google ScholarLocate open access versionFindings
  • Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics, 2011.
    Google ScholarLocate open access versionFindings
  • Noah Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing what worked: Behavior modelling priors for offline reinforcement learning. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin A. Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pages 387–395, 2014.
    Google ScholarLocate open access versionFindings
  • David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018.
    Google ScholarLocate open access versionFindings
  • Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Conference on Neural Information Processing Systems, page 1057–1063, 1999.
    Google ScholarLocate open access versionFindings
  • Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. DeepMind control suite. Preprint arXiv:1801.00690, 2018.
    Findings
  • Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
    Google ScholarLocate open access versionFindings
  • Hado van Hasselt and Marco Wiering. Reinforcement learning in continuous action spaces. In ADPRL 2007, 2020.
    Google ScholarLocate open access versionFindings
  • Qing Wang, Jiechao Xiong, Lei Han, peng sun, Han Liu, and Tong Zhang. Exponentially weighted imitation learning for batched historical data. In Conference on Neural Information Processing Systems, pages 6288–6297. 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments