# MOReL : Model-Based Offline Reinforcement Learning

NIPS 2020, 2020.

EI

关键词：

Eps-3 Gauss-1 Gauss-3continuous controloffline reinforcement learningmaximum mean discrepancyreinforcement learning更多(13+)

微博一下：

摘要：

In offline reinforcement learning (RL), the goal is to learn a successful policy using only a dataset of historical interactions with the environment, without any additional online interactions. This serves as an extreme test for an agent's ability to effectively use historical data, which is critical for efficient RL. Prior work in off...更多

代码：

数据：

简介

- The availability and use of large datasets have enabled tremendous advances in computer vision [1], speech recognition [2], and natural language processing [3, 4].
- In these fields, it is customary to first collect large datasets [5, 6, 7], train deep learning models on these datasets, and deploy these models on various platforms.
- Similar to progress in other fields of AI, the ability to effectively learn from large offline datasets may hold the key to unlocking the sample efficiency of RL agents

重点内容

- The availability and use of large datasets have enabled tremendous advances in computer vision [1], speech recognition [2], and natural language processing [3, 4]
- Similar to progress in other fields of AI, the ability to effectively learn from large offline datasets may hold the key to unlocking the sample efficiency of reinforcement learning (RL) agents
- We evaluate our algorithm in the standard continuous control benchmarks in OpenAI gym modified for the batch setting as done in a number of recent works [16, 17, 20], and find that our algorithm obtains state of the art (SOTA) results in a majority of the tasks
- Our results suggest that MOReL with the pessimistic Markov Decision Processes (MDP) construction significantly outperforms naive model-based RL (MBRL)
- We introduced a new model based framework MOReL for the offline RL problem
- MOReL incorporates both generalization and pessimism helping it perform policy search in known states that may not directly occur in the static offline dataset but can be predicted using the dataset, and at the same time do not drift into unknown states that cannot be predicted using the static offline data

方法

- Environments and partially trained policies: Following recent works in offline RL [16, 17, 20], the authors consider four continuous control tasks: Hopper-v2, HalfCheetah-v2, Ant-v2, Walker2d-v2 from OpenAI gym [77] simulated with MuJoCo [78].
- The authors typically have access to data collected using a partially trained sub-optimal policy interacting with the environment
- To simulate this setting, following guidelines from prior work [16, 17, 20], the authors obtain a partially trained policy πp by running TRPO [65] in these environments until the policy reaches a value of 1000, 4000, 1000, 1000 respectively for the four environments.

结果

- The authors' results suggest that MOReL with the pessimistic MDP construction significantly outperforms naive MBRL.

结论

- The authors introduced a new model based framework MOReL for the offline RL problem.
- The modular structure of MOReL comprising of model learning, uncertainty estimation and plannning allows the use of a variety of approaches in each of these modules.
- While the instantiation of MOReL in this paper uses simple and standard approaches, an interesting direction for future work is to explore the benefits of more sophisticated approaches such as multi-step prediction for model learning, prediction with abstention for uncertainty estimation and so on.
- MOReL’s modular structure allows it to automatically benefit from future progress in any of the modules

总结

## Introduction:

The availability and use of large datasets have enabled tremendous advances in computer vision [1], speech recognition [2], and natural language processing [3, 4].- In these fields, it is customary to first collect large datasets [5, 6, 7], train deep learning models on these datasets, and deploy these models on various platforms.
- Similar to progress in other fields of AI, the ability to effectively learn from large offline datasets may hold the key to unlocking the sample efficiency of RL agents
## Objectives:

The authors aim to design algorithms that would result in as low a sub-optimality as possible.## Methods:

Environments and partially trained policies: Following recent works in offline RL [16, 17, 20], the authors consider four continuous control tasks: Hopper-v2, HalfCheetah-v2, Ant-v2, Walker2d-v2 from OpenAI gym [77] simulated with MuJoCo [78].- The authors typically have access to data collected using a partially trained sub-optimal policy interacting with the environment
- To simulate this setting, following guidelines from prior work [16, 17, 20], the authors obtain a partially trained policy πp by running TRPO [65] in these environments until the policy reaches a value of 1000, 4000, 1000, 1000 respectively for the four environments.
## Results:

The authors' results suggest that MOReL with the pessimistic MDP construction significantly outperforms naive MBRL.## Conclusion:

The authors introduced a new model based framework MOReL for the offline RL problem.- The modular structure of MOReL comprising of model learning, uncertainty estimation and plannning allows the use of a variety of approaches in each of these modules.
- While the instantiation of MOReL in this paper uses simple and standard approaches, an interesting direction for future work is to explore the benefits of more sophisticated approaches such as multi-step prediction for model learning, prediction with abstention for uncertainty estimation and so on.
- MOReL’s modular structure allows it to automatically benefit from future progress in any of the modules

- Table1: Results in the four environments and five exploration configurations. 0 represents overflow/divergence for Q-learning based algorithms
- Table2: Value of the policy outputted by MOReL when working with a dataset collected with a random policy (Pure-random) and a partially trained policy (Pure-partial). The value of the behavior policy is indicated within the parenthesis. All results are averaged over 5 random seeds

相关工作

- Our work takes a model-based approach to offline RL. We review related work pertaining to both of these domains in this section.

2.1 The offline RL setting

Offline RL, as a problem setting, dates at least to the work of Lange et al [11]. In this setting, an RL agent is provided access to a typically large offline dataset, using which it has to produce a highly rewarding policy. This has direct applications in fields like healthcare [34, 35, 36], recommendation systems [37, 38, 39, 40], dialogue systems [41, 19, 42], and autonomous driving [43]. We refer the readers to the review paper of Levine et al [44] for an overview of potential applications. On the algorithmic front, prior work in offline RL can be broadly categorized into three groups as described below.

基金

- Rahul Kidambi acknowledges funding from NSF Award CCF − 1740822
- Thorsten Joachims acknowledges funding from NSF Award IIS − 1901168

引用论文

- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6), 2017.
- Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deep neural networks for acoustic modeling in speech recognition, November 26 2012.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546, 2013.
- Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. CoRR, abs/1802.05365, 2018.
- J. Deng, W. Dong, R. Socher, L. J. Li, K.Li, and L. Fei Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
- W. Fisher, G. Doddington, and K. Goudie-Marshall. The DARPA speech recognition research database: Specification and status. In Proceedings of the DARPA Workshop, pages 93–100, 1986.
- Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, and Phillipp Koehn. One billion word benchmark for measuring progress in statistical language modeling. CoRR, abs/1312.3005, 2013.
- R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
- Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. ArXiv, abs/1604.06778, 2016.
- OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, RafaÅC JÃszefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous in-hand manipulation. CoRR, abs/1808.00177, 2018.
- Sascha Lange, Thomas Gabel, and Martin A. Riedmiller. Batch reinforcement learning. In Reinforcement Learning, volume 12.
- Philip S Thomas. Safe reinforcement learning. PhD Thesis, 2014.
- Philip S. Thomas, Bruno Castro da Silva, Andrew G. Barto, Stephen Giguere, Yuriy Brun, and Emma Brunskill. Preventing undesirable behavior of intelligent machines. Science, 366(6468):999–1004, 2019.
- Scott Fujimoto, Herke van Hoof, and Dave Meger. Addressing function approximation error in actorcritic methods. CoRR, abs/1802.09477, 2018.
- Hado van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil. Deep reinforcement learning and the deadly triad. CoRR, abs/1812.02648, 2018.
- Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. CoRR, abs/1812.02900, 2018.
- Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. CoRR, abs/1906.00949, 2019.
- Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. Striving for simplicity in off-policy deep reinforcement learning. CoRR, abs/1907.04543, 2019.
- Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Àgata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind W. Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. CoRR, abs/1907.00456, 2019.
- Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. CoRR, arXiv:1911.11361, 2019.
- Romain Laroche and Paul Trichelair. Safe policy improvement with baseline bootstrapping. CoRR, abs/1712.06924, 2017.
- Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. CoRR, arXiv:1912.02074, 2019.
- Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In ACC, 2005.
- Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 4906–4913. IEEE, 2012.
- Cameron Browne, Edward Jack Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez Liebana, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4:1–43, 2012.
- Rémi Munos and Csaba Szepesvari. Finite-time bounds for fitted value iteration. J. Mach. Learn. Res., 9:815–857, 2008.
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
- Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. A game theoretic framework for model based reinforcement learning. ArXiv, abs/2004.07804, 2020.
- Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. CoRR, abs/1906.08253, 2019.
- Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NeurIPS, 2018.
- Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In ICRA, 2018.
- Sham M. Kakade. A natural policy gradient. In NIPS, pages 1531–1538, 2001.
- Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, and Sham Kakade. Towards Generalization and Simplicity in Continuous Control. In NIPS, 2017.
- Omer Gottesman, Fredrik D. Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li-Wei H. Lehman, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David A. Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings. CoRR, abs/1805.12298, 2018.
- Lu Wang, Wei Zhang 0056, Xiaofeng He, and Hongyuan Zha. Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation. In Yike Guo and Faisal Farooq, editors, KDD, pages 2447–2456. ACM, 2018.
- Chao Yu, Guoqi Ren, and Jiming Liu 0001. Deep inverse reinforcement learning for sepsis treatment. In ICHI, pages 1–3. IEEE, 2019.
- Alexander L. Strehl, John Langford, and Sham M. Kakade. Learning from logged implicit exploration data. CoRR, abs/1003.0120, 2010.
- Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. J. Mach. Learn. Res, 16:1731–1755, 2015.
- Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In RecSys. ACM, 2016.
- Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H. Chi. Top-k off-policy correction for a reinforce recommender system. CoRR, abs/1812.02353, 2018.
- Li Zhou, Kevin Small, Oleg Rokhlenko, and Charles Elkan. End-to-end offline goal-oriented dialog policy learning via policy gradient. CoRR, abs/1712.02838, 2017.
- Nikos Karampatziakis, Sebastian Kochman, Jade Huang, Paul Mineiro, Kathy Osborne, and Weizhu Chen. Lessons from real-world reinforcement learning in a customer support bot. CoRR, abs/1905.02219, 2019.
- Ahmad El Sallab, Mohammed Abdou, Etienne Perot, and Senthil Kumar Yogamani. Deep reinforcement learning framework for autonomous driving. CoRR, abs/1704.02532, 2017.
- Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. CoRR, abs/2005.01643, 2020.
- Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextualbandit-based news article recommendation algorithms, 2010. Comment: 10 pages, 7 figures, revised from the published version at the WSDM 2011 conference.
- Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. CoRR, abs/1904.08473, 2019.
- Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. CoRR, abs/1702.07121, 2017.
- Carles Gelada and Marc G. Bellemare. Off-policy deep reinforcement learning by bootstrapping the covariate shift. In AAAI, pages 3647–3655. AAAI Press, 2019.
- Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. CoRR, abs/1906.04733, 2019.
- Chris Watkins. Learning from delayed rewards. PhD Thesis, Cambridge University, 1989.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016.
- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR, abs/1801.01290, 2018.
- Ofir Nachum and Bo Dai. Reinforcement learning via fenchel-rockafellar duality. CoRR, arXiv:2001.01866, 2020.
- Stephane Ross and Drew Bagnell. Agnostic system identification for model-based reinforcement learning. In ICML, 2012.
- Michael Kearns and Satinder Singh. Near optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002.
- Ronen I. Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm for nearoptimal reinforcement learning. J. Mach. Learn. Res., 3:213–231, 2001.
- Alekh Agarwal, Sham M. Kakade, and Lin F. Yang. On the optimality of sparse model-based planning for markov decision processes. CoRR, abs/1906.03804, 2019.
- Sham M. Kakade, Michael J. Kearns, and John Langford. Exploration in metric state spaces. In ICML, 2003.
- Andy Zeng, Shuran Song, Johnny Lee, Alberto Rodríguez, and Thomas A. Funkhouser. Tossingbot: Learning to throw arbitrary objects with residual physics. ArXiv, abs/1903.11239, 2019.
- Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M. Rehg, Byron Boots, and Evangelos Theodorou. Information theoretic mpc for model-based reinforcement learning. 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721, 2017.
- Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control. In International Conference on Learning Representations (ICLR), 2019.
- Steven M. Lavalle. Rapidly-exploring random trees: A new tool for path planning, 1998.
- Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996.
- John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.
- Tingwu Wang and Jimmy Ba. Exploring model-based planning with policy networks. ArXiv, abs/1906.08649, 2020.
- Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, S. Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. ArXiv, abs/1907.02057, 2019.
- Anusha Nagabandi, Kurt Konoglie, Sergey Levine, and Vikash Kumar. Deep dynamics models for learning dexterous manipulation. ArXiv, abs/1909.11652, 2019.
- Yuxiang Yang, Ken Caluwaerts, Atil Iscen, Tingnan Zhang, Jie Tan, and Vikas Sindhwani. Data efficient reinforcement learning for legged robots. ArXiv, abs/1907.03613, 2019.
- Arun Venkatraman, Martial Hebert, and J. Andrew Bagnell. Improving multi-step prediction of learned time series models. In AAAI, 2015.
- Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. ArXiv, abs/1506.03099, 2015.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017.
- Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. CoRR, abs/1806.03335, 2018.
- Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through bayesian deep q-networks. In ITA, pages 1–9. IEEE, 2018.
- Yuri Burda, Harrison Edwards, Amos J. Storkey, and Oleg Klimov. Exploration by random network distillation. In ICLR. OpenReview.net, 2019.
- Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2018.
- Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
- Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, pages 5026–5033. IEEE, 2012.
- Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, ICLR, 2015.
- Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. In ICLR. OpenReview.net, 2018.
- Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In ICRA. IEEE, 2018.
- Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In ICLR. OpenReview.net, 2019.

标签

评论