RL Unplugged: A Collection of Benchmarks for Offline Reinforcement Learning

Thomas Paine
Thomas Paine
Sergio Gómez
Sergio Gómez
Rishabh Agarwal
Rishabh Agarwal
Cosmin Paduraru
Cosmin Paduraru
Jerry Li
Jerry Li

NeurIPS, 2020.

Cited by: 0|Bibtex|Views66|Links
EI
Keywords:
real-world RLreal world applicationbenchmark suiterl methodcontrol suiteMore(11+)
Weibo:
We empirically evaluate several state-of-art offline Reinforcement Learning methods and analyze their results on our benchmark suite

Abstract:

Offline methods for reinforcement learning have a potential to help bridge the gap between reinforcement learning research and real-world applications. They make it possible to learn policies from offline datasets, thus overcoming concerns associated with online data collection in the real-world, including cost, safety, or ethical concern...More

Code:

Data:

0
Introduction
  • Reinforcement Learning (RL) has seen important breakthroughs, including learning directly from raw sensory streams [Mnih et al, 2015], solving long-horizon reasoning problems such as Go [Silver et al, 2016], StarCraft II [Vinyals et al, 2019], DOTA [Berner et al, 2019], and learning motor control for high-dimensional simulated robots [Heess et al, 2017, Akkaya et al, 2019]
  • Many of these successes rely heavily on repeated online interactions of an agent with an environment.
  • 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
Highlights
  • Reinforcement Learning (RL) has seen important breakthroughs, including learning directly from raw sensory streams [Mnih et al, 2015], solving long-horizon reasoning problems such as Go [Silver et al, 2016], StarCraft II [Vinyals et al, 2019], DOTA [Berner et al, 2019], and learning motor control for high-dimensional simulated robots [Heess et al, 2017, Akkaya et al, 2019]
  • Offline RL would evaluate policies obtained by different hyperparameters using only logged data, for example using offline policy evaluation (OPE) methods [Voloshin et al, 2019]
  • We are releasing RL Unplugged, a suite of benchmarks covering a diverse set of environments, and datasets with an easy-to-use unified API
  • We empirically evaluate several state-of-art offline RL methods and analyze their results on our benchmark suite
  • The performance of the offline RL methods is already promising on some control suite tasks and Atari games
  • We intend to extend our benchmark suite with new environments and datasets from the community to close the gap between real-world applications and reinforcement learning research
Results
  • In a strict offline setting, environment interactions are not allowed
  • This makes hyperparameter tuning, including determining when to stop a training procedure, difficult.
  • Offline RL would evaluate policies obtained by different hyperparameters using only logged data, for example using offline policy evaluation (OPE) methods [Voloshin et al, 2019].
  • It is unclear whether current OPE methods scale well to difficult problems.
  • In RL Unplugged the authors would like to evaluate offline RL performance in both settings
Conclusion
  • The authors are releasing RL Unplugged, a suite of benchmarks covering a diverse set of environments, and datasets with an easy-to-use unified API.
  • The authors empirically evaluate several state-of-art offline RL methods and analyze their results on the benchmark suite.
  • The performance of the offline RL methods is already promising on some control suite tasks and Atari games.
  • On partially-observable environments such as the locomotion suite the offline RL methods’ performance is lower.
  • The authors intend to extend the benchmark suite with new environments and datasets from the community to close the gap between real-world applications and reinforcement learning research
Summary
  • Introduction:

    Reinforcement Learning (RL) has seen important breakthroughs, including learning directly from raw sensory streams [Mnih et al, 2015], solving long-horizon reasoning problems such as Go [Silver et al, 2016], StarCraft II [Vinyals et al, 2019], DOTA [Berner et al, 2019], and learning motor control for high-dimensional simulated robots [Heess et al, 2017, Akkaya et al, 2019]
  • Many of these successes rely heavily on repeated online interactions of an agent with an environment.
  • 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
  • Objectives:

    This paper aims to correct this such as to facilitate collaborative research and measurable progress in the field.
  • Results:

    In a strict offline setting, environment interactions are not allowed
  • This makes hyperparameter tuning, including determining when to stop a training procedure, difficult.
  • Offline RL would evaluate policies obtained by different hyperparameters using only logged data, for example using offline policy evaluation (OPE) methods [Voloshin et al, 2019].
  • It is unclear whether current OPE methods scale well to difficult problems.
  • In RL Unplugged the authors would like to evaluate offline RL performance in both settings
  • Conclusion:

    The authors are releasing RL Unplugged, a suite of benchmarks covering a diverse set of environments, and datasets with an easy-to-use unified API.
  • The authors empirically evaluate several state-of-art offline RL methods and analyze their results on the benchmark suite.
  • The performance of the offline RL methods is already promising on some control suite tasks and Atari games.
  • On partially-observable environments such as the locomotion suite the offline RL methods’ performance is lower.
  • The authors intend to extend the benchmark suite with new environments and datasets from the community to close the gap between real-world applications and reinforcement learning research
Tables
  • Table1: DM Control Suite tasks. We reserved five tasks for online policy selection (top) and the rest four are reserved for the offline policy selection (bottom). See Appendix E for reasoning behind choosing this particular task split
  • Table2: DM Locomotion tasks. We reserved four tasks for online policy selection (top) and the rest three are reserved for the offline policy selection (bottom). See Appendix E for reasoning behind choosing this particular task split
  • Table3: Atari games. We have 46 games in total in our Atari data release. We reserved 9 of the games for online policy selection (top) and the rest of the 37 games are reserved for the offline policy selection (bottom)
Download tables as Excel
Related work
  • There is a large body of work focused on developing novel offline reinforcement learning algorithms [Fujimoto et al, 2018, Wu et al, 2019, Agarwal et al, 2020, Siegel et al, 2020]. These works have often tested their methods on simple MDPs such as grid worlds [Laroche et al, 2017], or fully observed environments were the state of the world is given [Fujimoto et al, 2018, Wu et al, 2019, Fu et al, 2020]. There has also been extensive work applying offline reinforcement learning to difficult real-world domains such as robots [Cabi et al, 2019, Gu et al, 2017, Kalashnikov et al, 2018] or dialog [Henderson et al, 2008, Pietquin et al, 2011, Jaques et al, 2019], but it is often difficult to do thorough evaluations in these domains for the same reason offline RL is useful in them, namely that interaction with the environment is costly. Additionally, without consistent environments and datasets, it is impossible to clearly compare these different algorithmic approaches. We instead focus on a range of challenging simulated environments, and establishing them as a benchmark for offline RL algorithms. There are two works similar in that regard. The first is [Agarwal et al, 2020] which release DQN Replay dataset for Atari 2600 games, a challenging and well known RL benchmark. We have reached out to the authors to include this dataset as part of our benchmark. The second is [Fu et al, 2020] which released datasets for a range of control tasks, including the Control Suite, and dexterous manipulation tasks. Unlike our benchmark which includes tasks that test memory and representation learning, their tasks are all from fully observable MDPs, where the physical state information is explicitly provided.
Funding
  • OpenAI gym implements more than 46 games, but we only include games where the online DQN’s performance that has generated the dataset was significantly better than the random policy
Study subjects and analysis
datasets: 3
These policies were obtained by training 3 seeds of distributional MPO [Abdolmaleki et al, 2018] until convergence with different random weight initializations, and then taking snapshots corresponding to roughly 75% of the converged performance. For the no challenge setting, three datasets of different sizes were generated for each environment by combining the three snapshots, with the total dataset sizes (in numbers of episodes) provided in Table 4. The procedure was repeated for the easy combined challenge setting

Reference
  • A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. A. Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • R. Agarwal, D. Schuurmans, and M. Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, 2020.
    Google ScholarLocate open access versionFindings
  • I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
    Findings
  • G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. Tb, A. Muldal, N. Heess, and T. Lillicrap. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
    Findings
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
    Google ScholarLocate open access versionFindings
  • C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
    Findings
  • S. Cabi, S. G. Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zołna, Y. Aytar, D. Budden, M. Vecerik, O. Sushkov, D. Barker, J. Scholz, M. D. andx Nando de Freitas, and Z. Wang. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv preprint arXiv:1909.12200, 2019.
    Findings
  • W. Dabney, G. Ostrovski, D. Silver, and R. Munos. Implicit quantile networks for distributional reinforcement learning. arXiv preprint arXiv:1806.06923, 2018.
    Findings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
    Google ScholarLocate open access versionFindings
  • G. Dulac-Arnold, D. Mankowitz, and T. Hester. Challenges of real-world reinforcement learning, 2019.
    Google ScholarFindings
  • G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester. An empirical investigation of the challenges of real-world reinforcement learning. 2020.
    Google ScholarFindings
  • J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2020.
    Google ScholarFindings
  • S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900, 2018.
    Findings
  • S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau. Benchmarking batch deep reinforcement learning algorithms. arXiv preprint arXiv:1910.01708, 2019.
    Findings
  • S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3389–3396. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
    Findings
  • N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. Riedmiller, and D. Silver. Emergence of locomotion behaviours in rich environments, 2017.
    Google ScholarLocate open access versionFindings
  • J. Henderson, O. Lemon, and K. Georgila. Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets. Computational Linguistics, 34(4):487–511, 2008.
    Google ScholarLocate open access versionFindings
  • P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • M. Hoffman, B. Shahriari, J. Aslanides, G. Barth-Maron, F. Behbahani, T. Norman, A. Abdolmaleki, A. Cassirer, F. Yang, K. Baumli, S. Henderson, A. Novikov, S. G. Colmenarejo, S. Cabi, C. Gulcehre, T. L. Paine, A. Cowie, Z. Wang, B. Piot, and N. de Freitas. Acme: A research framework for distributed reinforcement learning. Preprint arXiv:2006.00979, 2020.
    Findings
  • N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
    Findings
  • D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
    Findings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. 2015.
    Google ScholarFindings
  • R. Laroche, P. Trichelair, and R. T. d. Combes. Safe policy improvement with baseline bootstrapping. arXiv preprint arXiv:1712.06924, 2017.
    Findings
  • M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
    Google ScholarLocate open access versionFindings
  • J. Merel, A. Ahuja, V. Pham, S. Tunyasuvunakool, S. Liu, D. Tirumala, N. Heess, and G. Wayne. Hierarchical visuomotor control of humanoids. In International Conference on Learning Representations, 2019a.
    Google ScholarLocate open access versionFindings
  • J. Merel, L. Hasenclever, A. Galashov, A. Ahuja, V. Pham, G. Wayne, Y. W. Teh, and N. Heess. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019b.
    Google ScholarLocate open access versionFindings
  • J. Merel, D. Aldarondo, J. Marshall, Y. Tassa, G. Wayne, and B. Ölveczky. Deep neuroethology of a virtual rodent. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
    Findings
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
    Findings
  • O. Pietquin, M. Geist, S. Chandramohan, and H. Frezza-Buet. Sample-efficient batch reinforcement learning for dialogue management optimization. ACM Transactions on Speech and Language Processing (TSLP), 7(3):1–21, 2011.
    Google ScholarLocate open access versionFindings
  • D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
    Google ScholarLocate open access versionFindings
  • N. Y. Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R. Hafner, and M. Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. 2020.
    Google ScholarFindings
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
    Google ScholarLocate open access versionFindings
  • H. F. Song, A. Abdolmaleki, J. T. Springenberg, A. Clark, H. Soyer, J. W. Rae, S. Noury, A. Ahuja, S. Liu, D. Tirumala, et al. V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. P. Lillicrap, and M. A. Riedmiller. DeepMind Control Suite. CoRR, abs/1801.00690, 2018. URL http://arxiv.org/abs/1801.00690. Y. Tassa, S. Tunyasuvunakool, A. Muldal, Y. Doron, S. Liu, S. Bohez, J. Merel, T. Erez, T. P. Lillicrap, and N. Heess.dm_control: Software and tasks for continuous control.arXiv preprint arXiv:2006.12983, 2020.
    Findings
  • E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016.
    Google ScholarLocate open access versionFindings
  • A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016.
    Findings
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
    Google ScholarLocate open access versionFindings
  • C. Voloshin, H. M. Le, N. Jiang, and Y. Yue. Empirical study of off-policy policy evaluation for reinforcement learning. arXiv preprint arXiv:1911.06854, 2019.
    Findings
  • Y. Wu, G. Tucker, and O. Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
    Findings
Your rating :
0

 

Tags
Comments