Making Efficient Use of Demonstrations to Solve Hard Exploration Problems

ICLR, 2020.

Cited by: 1|Bibtex|Views132|Links
EI
Keywords:
imitation learning deep learning reinforcement learning
Weibo:
We introduced the R2D3 agent, which is designed to make efficient use of demonstrations to learn in partially observable environments with sparse rewards and highly variable initial conditions

Abstract:

This paper introduces R2D3, an agent that makes efficient use of demonstrations to solve hard exploration problems in partially observable environments with highly variable initial conditions. We also introduce a suite of eight tasks that combine these three properties, and show that R2D3 can solve several of the tasks where other state o...More

Code:

Data:

Introduction
  • Reinforcement learning from demonstrations has proven to be an effective strategy for attacking problems that require sample efficiency and involve hard exploration.
  • We attack the problem of learning from demonstrations in hard exploration tasks in partially observable environments with highly variable initial conditions.
  • These three aspects together conspire to make learning challenging: 1.
  • Our setting combines all of these, so we leave extending GAIL to this combined setting for future work
Highlights
  • Reinforcement learning from demonstrations has proven to be an effective strategy for attacking problems that require sample efficiency and involve hard exploration
  • Sparse rewards induce a difficult exploration problem, which is a challenge for many state of the art RL methods
  • GAIL (Ho and Ermon, 2016) is another imitation learning method, standard GAIL does not work in the following settings: 1) POMDPs (Gangwani et al, 2019; Zołna et al, 2019), 2) from pixels (Li et al, 2017; Reed et al, 2018), 3) off policy (Kostrikov et al, 2018) and 4) with variable initial conditions (Zolna et al, 2019)
  • The task here is to push a particular block onto a sensor to give access to a large apple, and we examine the behavior of both R2D3 and R2D2 after 5B steps, which is long before R2D3 begins to solve the task with any regularity
  • We introduced the R2D3 agent, which is designed to make efficient use of demonstrations to learn in partially observable environments with sparse rewards and highly variable initial conditions
  • We showed through several experiments on eight very difficult tasks that our approach is able to outperform multiple state of the art baselines, two of which are themselves ablations of R2D3
Methods
  • We evaluate the performance of our R2D3 agent alongside state-of-the-art deep RL baselines.
  • As discussed in Section 5, we compare our R2D3 agent to BC R2D2, DQfD (LfD SOTA).
  • For each task in the Hard-Eight suite, we trained R2D3, R2D2, and DQfD using 256 -greedy CPU-based actors and a single GPU-based learner process.
  • For R2D3 and DQfD the demo ratio was varied to study its effect.
  • For BC we varied the learning rate independently in a vain attempt to find a successful agent
Results
  • An alternative approximation would be to store stale recurrent states in replay, but we did not find this to improve performance over zero initialization with burn-in.
  • We showed through several experiments on eight very difficult tasks that our approach is able to outperform multiple state of the art baselines, two of which are themselves ablations of R2D3
Conclusion
  • We introduced the R2D3 agent, which is designed to make efficient use of demonstrations to learn in partially observable environments with sparse rewards and highly variable initial conditions.
  • We introduced the Hard-Eight suite of tasks and used them in all of our experiments
  • These tasks are designed to be partially observable tasks with sparse rewards and highly variable initial conditions, making them an ideal testbed for showcasing the strengths of R2D3 in contrast to existing methods in the literature
Summary
  • Introduction:

    Reinforcement learning from demonstrations has proven to be an effective strategy for attacking problems that require sample efficiency and involve hard exploration.
  • We attack the problem of learning from demonstrations in hard exploration tasks in partially observable environments with highly variable initial conditions.
  • These three aspects together conspire to make learning challenging: 1.
  • Our setting combines all of these, so we leave extending GAIL to this combined setting for future work
  • Methods:

    We evaluate the performance of our R2D3 agent alongside state-of-the-art deep RL baselines.
  • As discussed in Section 5, we compare our R2D3 agent to BC R2D2, DQfD (LfD SOTA).
  • For each task in the Hard-Eight suite, we trained R2D3, R2D2, and DQfD using 256 -greedy CPU-based actors and a single GPU-based learner process.
  • For R2D3 and DQfD the demo ratio was varied to study its effect.
  • For BC we varied the learning rate independently in a vain attempt to find a successful agent
  • Results:

    An alternative approximation would be to store stale recurrent states in replay, but we did not find this to improve performance over zero initialization with burn-in.
  • We showed through several experiments on eight very difficult tasks that our approach is able to outperform multiple state of the art baselines, two of which are themselves ablations of R2D3
  • Conclusion:

    We introduced the R2D3 agent, which is designed to make efficient use of demonstrations to learn in partially observable environments with sparse rewards and highly variable initial conditions.
  • We introduced the Hard-Eight suite of tasks and used them in all of our experiments
  • These tasks are designed to be partially observable tasks with sparse rewards and highly variable initial conditions, making them an ideal testbed for showcasing the strengths of R2D3 in contrast to existing methods in the literature
Tables
  • Table1: Human demonstration statistics. We collected 100 demos for each tasks from three human demonstrators. We report mean lengths (in number of frames) and rewards of the episodes along with the standard deviations for each task
  • Table2: Hyper-parameters used for all experiments
Download tables as Excel
Funding
  • Introduces a suite of eight tasks that combine these three properties, and show that R2D3 can solve several of the tasks where other state of the art methods fail to see even a single successful trajectory after tens of billions of steps of exploration
  • Identifies a key parameter of our algorithm, the demo-ratio, which controls the proportion of expert demonstrations vs agent experience in each training batch
  • Introduces a suite of tasks that exhibit our three targeted properties
  • Proposes a new agent, which refers to as Recurrent Replay Distributed DQN from Demonstrations
  • Did not find this to improve performance over zero initialization with burn-in
Reference
  • Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pages 265–283, 2016.
    Google ScholarLocate open access versionFindings
  • Yusuf Aytar, Tobias Pfaff, David Budden, Thomas Paine, Ziyu Wang, and Nando de Freitas. Playing hard exploration games by watching YouTube. In Advances in Neural Information Processing Systems, pages 2930–2941, 2018.
    Google ScholarLocate open access versionFindings
  • Marc G Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awareness using Atari 2600 games. In AAAI Conference on Artificial Intelligence, pages 864–871, 2012.
    Google ScholarLocate open access versionFindings
  • Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
    Google ScholarLocate open access versionFindings
  • Marc G Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In International Conference on Machine Learning, pages 41–48, 2009.
    Google ScholarLocate open access versionFindings
  • Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 1281–1288, 2005.
    Google ScholarLocate open access versionFindings
  • Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
    Findings
  • Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900, 2018.
    Findings
  • Tanmay Gangwani, Joel Lehman, Qiang Liu, and Jian Peng. Learning belief representations for imitation learning in pomdps. arXiv preprint arXiv:1906.09510, 2019.
    Findings
  • Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. Divide-and-conquer reinforcement learning. arXiv preprint arXiv:1711.09874, 2017.
    Findings
  • Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. In International Conference on Machine Learning, pages 1311–1320, 2017.
    Google ScholarLocate open access versionFindings
  • Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In AAAI Conference on Artificial Intelligence, pages 3215–3222, 2018.
    Google ScholarLocate open access versionFindings
  • Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. Deep Q-learning from demonstrations. In AAAI Conference on Artificial Intelligence, pages 3223–3230, 2018.
    Google ScholarLocate open access versionFindings
  • Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
    Google ScholarLocate open access versionFindings
  • Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver. Distributed prioritized experience replay. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • John D Hunter. Matplotlib: A 2D graphics environment. Computing in science & engineering, 9(3):90–95, 2007.
    Google ScholarLocate open access versionFindings
  • Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
    Findings
  • Arthur Juliani. On “solving” Montezuma’s revenge. https://medium.com/@awjuliani/on-solving-montezumas-revenge-2146d83f0bc3, 2018. Accessed:2019-19-21.
    Findings
  • Arthur Juliani, Ahmed Khalifa, Vincent-Pierre Berges, Jonathan Harper, Hunter Henry, Adam Crespi, Julian Togelius, and Danny Lange. Obstacle tower: A generalization challenge in vision, control, and planning. In AAAI-19 Workshop on Games and Simulations for Artificial Intelligence, 2019.
    Google ScholarLocate open access versionFindings
  • Bingyi Kang, Zequn Jie, and Jiashi Feng. Policy optimization with demonstrations. In International Conference on Machine Learning, pages 2469–2478, 2018.
    Google ScholarLocate open access versionFindings
  • Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Beomjoon Kim, Amir-massoud Farahmand, Joelle Pineau, and Doina Precup. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pages 2859–2867, 2013.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. arXiv preprint arXiv:1809.02925, 2018.
    Findings
  • Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057, 2019.
    Findings
  • Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pages 3812–3822, 2017.
    Google ScholarLocate open access versionFindings
  • Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
    Google ScholarLocate open access versionFindings
  • Wes McKinney et al. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, pages 51–56, 2010.
    Google ScholarLocate open access versionFindings
  • Josh Merel, Yuval Tassa, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg Wayne, and Nicolas Heess. Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201, 2017.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. In IEEE International Conference on Robotics and Automation, pages 6292–6299, 2018.
    Google ScholarLocate open access versionFindings
  • Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning, pages 278–287, 1999.
    Google ScholarLocate open access versionFindings
  • Travis Oliphant. Guide to NumPy. USA: Trelgol Publishing, 2006.
    Google ScholarFindings
  • Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 8617–8629, 2018.
    Google ScholarLocate open access versionFindings
  • Tom Le Paine, Sergio Gómez Colmenarejo, Ziyu Wang, Scott Reed, Yusuf Aytar, Tobias Pfaff, Matt W Hoffman, Gabriel Barth-Maron, Serkan Cabi, David Budden, et al. One-shot high-fidelity imitation: Training large-scale deep nets with RL. arXiv preprint arXiv:1810.05017, 2018.
    Findings
  • Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics, 37(4):1:14, 2018.
    Google ScholarLocate open access versionFindings
  • Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Vecerík, et al. Observe and look further: Achieving consistent performance on atari. arXiv preprint arXiv:1805.11593, 2018.
    Findings
  • Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
    Google ScholarLocate open access versionFindings
  • Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau Bölöni, and Sergey Levine. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. In IEEE International Conference on Robotics and Automation, pages 3758–3765, 2018.
    Google ScholarLocate open access versionFindings
  • Scott Reed, Yusuf Aytar, Ziyu Wang, Tom Paine, Aäron van den Oord, Tobias Pfaff, Sergio Gomez, Alexander Novikov, David Budden, and Oriol Vinyals. Visual imitation with a minimal adversary. 2018.
    Google ScholarFindings
  • https://openai.com/blog/
    Findings
  • learning-montezumas-revenge-from-a-single-demonstration, 2018a. Accessed: 2019-19-22.
    Google ScholarFindings
  • Tim Salimans and Richard Chen. Learning Montezuma’s revenge from a single demonstration. arXiv preprint arXiv:1812.03381, 2018b.
    Findings
  • Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • Jürgen Schmidhuber. Curious model-building control systems. In IEEE International Joint Conference on Neural Networks, pages 1458–1463, 1991.
    Google ScholarLocate open access versionFindings
  • Adrien Ali Taïga, William Fedus, Marlos C Machado, Aaron Courville, and Marc G Bellemare. Benchmarking bonus-based exploration methods on the arcade learning environment. arXiv preprint arXiv:1908.02388, 2019.
    Findings
  • Matej Vecerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
    Findings
  • Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, pages 1995–2003, 2016.
    Google ScholarLocate open access versionFindings
  • Konrad Zolna, Scott Reed, Alexander Novikov, Sergio Gomez Colmenarej, David Budden, Serkan Cabi, Misha Denil, Nando de Freitas, and Ziyu Wang. Task-relevant adversarial imitation learning. arXiv preprint arXiv:1910.01077, 2019.
    Findings
  • Konrad Zołna, Negar Rostamzadeh, Yoshua Bengio, Sungjin Ahn, and Pedro O Pinheiro. Reinforced imitation in heterogeneous action space. arXiv preprint arXiv:1904.03438, 2019.
    Findings
  • Highly Variable Initial Conditions Many of the elements of the tasks are procedurally generated, which leads to significant variability between episodes of the same task. In particular, the starting position and orientation of the agent are randomized and similarly, where they are present, the shapes, colors, and textures of various objects are randomly sampled from a set of available such features. Therefore a single (or small number of) demonstration(s) is not sufficient to guide an agent to solve the task as it is in the case of DQfD on Atari (Pohlen et al., 2018).
    Google ScholarFindings
  • This section gives addition details on each task in our suite including a sequence frames from a successful task execution (performed by a human) and a list of the procedural elements randomized per episode. Videos of agents and humans performing these tasks can be found at https://bit.ly/2mAAUgg.
    Findings
  • Published as a conference paper at ICLR 2020 Remember Sensor
    Google ScholarFindings
  • 2e-4 Adam (Kingma and Ba, 2014) True 0.997 32 400 200 True 80 40 True 256 500000 25000
    Google ScholarFindings
Your rating :
0

 

Tags
Comments