RL Unplugged: A Collection of Benchmarks for Offline Reinforcement Learning
NeurIPS, 2020.
EI
Keywords:
Weibo:
Abstract:
Offline methods for reinforcement learning have a potential to help bridge the gap between reinforcement learning research and real-world applications. They make it possible to learn policies from offline datasets, thus overcoming concerns associated with online data collection in the real-world, including cost, safety, or ethical concern...More
Code:
Data:
Introduction
- Reinforcement Learning (RL) has seen important breakthroughs, including learning directly from raw sensory streams [Mnih et al, 2015], solving long-horizon reasoning problems such as Go [Silver et al, 2016], StarCraft II [Vinyals et al, 2019], DOTA [Berner et al, 2019], and learning motor control for high-dimensional simulated robots [Heess et al, 2017, Akkaya et al, 2019]
- Many of these successes rely heavily on repeated online interactions of an agent with an environment.
- 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
Highlights
- Reinforcement Learning (RL) has seen important breakthroughs, including learning directly from raw sensory streams [Mnih et al, 2015], solving long-horizon reasoning problems such as Go [Silver et al, 2016], StarCraft II [Vinyals et al, 2019], DOTA [Berner et al, 2019], and learning motor control for high-dimensional simulated robots [Heess et al, 2017, Akkaya et al, 2019]
- Offline RL would evaluate policies obtained by different hyperparameters using only logged data, for example using offline policy evaluation (OPE) methods [Voloshin et al, 2019]
- We are releasing RL Unplugged, a suite of benchmarks covering a diverse set of environments, and datasets with an easy-to-use unified API
- We empirically evaluate several state-of-art offline RL methods and analyze their results on our benchmark suite
- The performance of the offline RL methods is already promising on some control suite tasks and Atari games
- We intend to extend our benchmark suite with new environments and datasets from the community to close the gap between real-world applications and reinforcement learning research
Results
- In a strict offline setting, environment interactions are not allowed
- This makes hyperparameter tuning, including determining when to stop a training procedure, difficult.
- Offline RL would evaluate policies obtained by different hyperparameters using only logged data, for example using offline policy evaluation (OPE) methods [Voloshin et al, 2019].
- It is unclear whether current OPE methods scale well to difficult problems.
- In RL Unplugged the authors would like to evaluate offline RL performance in both settings
Conclusion
- The authors are releasing RL Unplugged, a suite of benchmarks covering a diverse set of environments, and datasets with an easy-to-use unified API.
- The authors empirically evaluate several state-of-art offline RL methods and analyze their results on the benchmark suite.
- The performance of the offline RL methods is already promising on some control suite tasks and Atari games.
- On partially-observable environments such as the locomotion suite the offline RL methods’ performance is lower.
- The authors intend to extend the benchmark suite with new environments and datasets from the community to close the gap between real-world applications and reinforcement learning research
Summary
Introduction:
Reinforcement Learning (RL) has seen important breakthroughs, including learning directly from raw sensory streams [Mnih et al, 2015], solving long-horizon reasoning problems such as Go [Silver et al, 2016], StarCraft II [Vinyals et al, 2019], DOTA [Berner et al, 2019], and learning motor control for high-dimensional simulated robots [Heess et al, 2017, Akkaya et al, 2019]- Many of these successes rely heavily on repeated online interactions of an agent with an environment.
- 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
Objectives:
This paper aims to correct this such as to facilitate collaborative research and measurable progress in the field.Results:
In a strict offline setting, environment interactions are not allowed- This makes hyperparameter tuning, including determining when to stop a training procedure, difficult.
- Offline RL would evaluate policies obtained by different hyperparameters using only logged data, for example using offline policy evaluation (OPE) methods [Voloshin et al, 2019].
- It is unclear whether current OPE methods scale well to difficult problems.
- In RL Unplugged the authors would like to evaluate offline RL performance in both settings
Conclusion:
The authors are releasing RL Unplugged, a suite of benchmarks covering a diverse set of environments, and datasets with an easy-to-use unified API.- The authors empirically evaluate several state-of-art offline RL methods and analyze their results on the benchmark suite.
- The performance of the offline RL methods is already promising on some control suite tasks and Atari games.
- On partially-observable environments such as the locomotion suite the offline RL methods’ performance is lower.
- The authors intend to extend the benchmark suite with new environments and datasets from the community to close the gap between real-world applications and reinforcement learning research
Tables
- Table1: DM Control Suite tasks. We reserved five tasks for online policy selection (top) and the rest four are reserved for the offline policy selection (bottom). See Appendix E for reasoning behind choosing this particular task split
- Table2: DM Locomotion tasks. We reserved four tasks for online policy selection (top) and the rest three are reserved for the offline policy selection (bottom). See Appendix E for reasoning behind choosing this particular task split
- Table3: Atari games. We have 46 games in total in our Atari data release. We reserved 9 of the games for online policy selection (top) and the rest of the 37 games are reserved for the offline policy selection (bottom)
Related work
- There is a large body of work focused on developing novel offline reinforcement learning algorithms [Fujimoto et al, 2018, Wu et al, 2019, Agarwal et al, 2020, Siegel et al, 2020]. These works have often tested their methods on simple MDPs such as grid worlds [Laroche et al, 2017], or fully observed environments were the state of the world is given [Fujimoto et al, 2018, Wu et al, 2019, Fu et al, 2020]. There has also been extensive work applying offline reinforcement learning to difficult real-world domains such as robots [Cabi et al, 2019, Gu et al, 2017, Kalashnikov et al, 2018] or dialog [Henderson et al, 2008, Pietquin et al, 2011, Jaques et al, 2019], but it is often difficult to do thorough evaluations in these domains for the same reason offline RL is useful in them, namely that interaction with the environment is costly. Additionally, without consistent environments and datasets, it is impossible to clearly compare these different algorithmic approaches. We instead focus on a range of challenging simulated environments, and establishing them as a benchmark for offline RL algorithms. There are two works similar in that regard. The first is [Agarwal et al, 2020] which release DQN Replay dataset for Atari 2600 games, a challenging and well known RL benchmark. We have reached out to the authors to include this dataset as part of our benchmark. The second is [Fu et al, 2020] which released datasets for a range of control tasks, including the Control Suite, and dexterous manipulation tasks. Unlike our benchmark which includes tasks that test memory and representation learning, their tasks are all from fully observable MDPs, where the physical state information is explicitly provided.
Funding
- OpenAI gym implements more than 46 games, but we only include games where the online DQN’s performance that has generated the dataset was significantly better than the random policy
Study subjects and analysis
datasets: 3
These policies were obtained by training 3 seeds of distributional MPO [Abdolmaleki et al, 2018] until convergence with different random weight initializations, and then taking snapshots corresponding to roughly 75% of the converged performance. For the no challenge setting, three datasets of different sizes were generated for each environment by combining the three snapshots, with the total dataset sizes (in numbers of episodes) provided in Table 4. The procedure was repeated for the easy combined challenge setting
Reference
- A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. A. Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations (ICLR), 2018.
- R. Agarwal, D. Schuurmans, and M. Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, 2020.
- I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
- G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. Tb, A. Muldal, N. Heess, and T. Lillicrap. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
- M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- S. Cabi, S. G. Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zołna, Y. Aytar, D. Budden, M. Vecerik, O. Sushkov, D. Barker, J. Scholz, M. D. andx Nando de Freitas, and Z. Wang. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv preprint arXiv:1909.12200, 2019.
- W. Dabney, G. Ostrovski, D. Silver, and R. Munos. Implicit quantile networks for distributional reinforcement learning. arXiv preprint arXiv:1806.06923, 2018.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
- G. Dulac-Arnold, D. Mankowitz, and T. Hester. Challenges of real-world reinforcement learning, 2019.
- G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester. An empirical investigation of the challenges of real-world reinforcement learning. 2020.
- J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2020.
- S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900, 2018.
- S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau. Benchmarking batch deep reinforcement learning algorithms. arXiv preprint arXiv:1910.01708, 2019.
- S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3389–3396. IEEE, 2017.
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
- N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. Riedmiller, and D. Silver. Emergence of locomotion behaviours in rich environments, 2017.
- J. Henderson, O. Lemon, and K. Georgila. Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets. Computational Linguistics, 34(4):487–511, 2008.
- P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- M. Hoffman, B. Shahriari, J. Aslanides, G. Barth-Maron, F. Behbahani, T. Norman, A. Abdolmaleki, A. Cassirer, F. Yang, K. Baumli, S. Henderson, A. Novikov, S. G. Colmenarejo, S. Cabi, C. Gulcehre, T. L. Paine, A. Cowie, Z. Wang, B. Piot, and N. de Freitas. Acme: A research framework for distributed reinforcement learning. Preprint arXiv:2006.00979, 2020.
- N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
- D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. 2015.
- R. Laroche, P. Trichelair, and R. T. d. Combes. Safe policy improvement with baseline bootstrapping. arXiv preprint arXiv:1712.06924, 2017.
- M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
- J. Merel, A. Ahuja, V. Pham, S. Tunyasuvunakool, S. Liu, D. Tirumala, N. Heess, and G. Wayne. Hierarchical visuomotor control of humanoids. In International Conference on Learning Representations, 2019a.
- J. Merel, L. Hasenclever, A. Galashov, A. Ahuja, V. Pham, G. Wayne, Y. W. Teh, and N. Heess. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019b.
- J. Merel, D. Aldarondo, J. Marshall, Y. Tassa, G. Wayne, and B. Ölveczky. Deep neuroethology of a virtual rodent. In International Conference on Learning Representations, 2020.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
- O. Pietquin, M. Geist, S. Chandramohan, and H. Frezza-Buet. Sample-efficient batch reinforcement learning for dialogue management optimization. ACM Transactions on Speech and Language Processing (TSLP), 7(3):1–21, 2011.
- D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
- N. Y. Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R. Hafner, and M. Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. 2020.
- D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- H. F. Song, A. Abdolmaleki, J. T. Springenberg, A. Clark, H. Soyer, J. W. Rae, S. Noury, A. Ahuja, S. Liu, D. Tirumala, et al. V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control. In International Conference on Learning Representations, 2020.
- Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. P. Lillicrap, and M. A. Riedmiller. DeepMind Control Suite. CoRR, abs/1801.00690, 2018. URL http://arxiv.org/abs/1801.00690. Y. Tassa, S. Tunyasuvunakool, A. Muldal, Y. Doron, S. Liu, S. Bohez, J. Merel, T. Erez, T. P. Lillicrap, and N. Heess.dm_control: Software and tasks for continuous control.arXiv preprint arXiv:2006.12983, 2020.
- E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016.
- A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016.
- O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- C. Voloshin, H. M. Le, N. Jiang, and Y. Yue. Empirical study of off-policy policy evaluation for reinforcement learning. arXiv preprint arXiv:1911.06854, 2019.
- Y. Wu, G. Tucker, and O. Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
Tags
Comments