Scaling data-driven robotics with reward sketching and batch reinforcement learning
robotics science and systems, 2020.
Keywords:
Weibo:
Abstract:
By harnessing a growing dataset of robot experience, we learn control policies for a diverse and increasing set of related manipulation tasks. To make this possible, we introduce reward sketching: an effective way of eliciting human preferences to learn the reward function for a new task. This reward function is then used to retrospective...More
Code:
Data:
Introduction
- Deep learning has successfully advanced many areas of artificial intelligence, including vision [39, 26], speech recognition [24, 46, 4], natural language processing [17], and reinforcement learning (RL) [49, 63].
- The success of deep learning in each of these fields was made possible by the availability of huge amounts of labeled training data.
- In simulated environments like video games, where experience and rewards are easy to obtain, deep RL is tremendously successful in outperforming top skilled humans by ingesting huge amounts of data [63, 69, 9].
- The lack of large datasets with reward signals has limited the effectiveness of deep RL in robotics
Highlights
- Deep learning has successfully advanced many areas of artificial intelligence, including vision [39, 26], speech recognition [24, 46, 4], natural language processing [17], and reinforcement learning (RL) [49, 63]
- Evaluation: While the reward and policy are learned from data, we cannot assess their ultimate quality without running the agent on the real robot
- As the agent is learned off-line, good performance on the real robot is a powerful indicator of generalization
- Its key components include a method for reward learning, retrospective reward labelling and batch reinforcement learning with distributional value functions
- We found that reward sketching is an effective way to elicit reward functions, since humans are good at judging progress toward a goal
- Diversity of training data seems to be an essential factor in the success of standard state-of-the-art reinforcement learning algorithms, which were previously reported to fail when trained only on expert data or the history of a single agent [22]
Methods
- The general workflow is illustrated in Fig. 1 and a more detailed procedure is presented in Fig. 5.
- A task-specific reward model allows them to retrospectively annotate data in NES with reward signals for a new task.
- The authors can train batch RL agents with all the data in NES.
- The procedure for training an agent to complete a new task has the following steps which are described in turn in the remainder of the section: A.
- A human teleoperates the robot to provide first-person demonstrations of the target task
Results
- Existing RL approaches for real-world robotics mainly focus on tasks where hand-crafted reward mechanisms can be developed.
- Simple behaviours such as learning to grasp objects [35] or learning to fly [23] by avoiding crashing can be acquired by reward engineering.
- As the agent is learned off-line, good performance on the real robot is a powerful indicator of generalization
- To this end, the authors conducted controlled evaluations on the physical robot with fixed initial conditions across different policies.
- The hard and unseen conditions are especially challenging, since they require the agent to cope with novel objects and novel object configurations
Conclusion
- The authors have proposed a new data-driven approach to robotics.
- Its key components include a method for reward learning, retrospective reward labelling and batch RL with distributional value functions.
- To further advance data-driven robotics, reward learning and batch RL, the authors release the large datasets [16] from NeverEnding Storage and canonical agents [28].
- The authors' results across a wide set of tasks illustrate the versatility of the data-driven approach.
- The learned agents showed a significant degree of generalization and robustness
Summary
Introduction:
Deep learning has successfully advanced many areas of artificial intelligence, including vision [39, 26], speech recognition [24, 46, 4], natural language processing [17], and reinforcement learning (RL) [49, 63].- The success of deep learning in each of these fields was made possible by the availability of huge amounts of labeled training data.
- In simulated environments like video games, where experience and rewards are easy to obtain, deep RL is tremendously successful in outperforming top skilled humans by ingesting huge amounts of data [63, 69, 9].
- The lack of large datasets with reward signals has limited the effectiveness of deep RL in robotics
Methods:
The general workflow is illustrated in Fig. 1 and a more detailed procedure is presented in Fig. 5.- A task-specific reward model allows them to retrospectively annotate data in NES with reward signals for a new task.
- The authors can train batch RL agents with all the data in NES.
- The procedure for training an agent to complete a new task has the following steps which are described in turn in the remainder of the section: A.
- A human teleoperates the robot to provide first-person demonstrations of the target task
Results:
Existing RL approaches for real-world robotics mainly focus on tasks where hand-crafted reward mechanisms can be developed.- Simple behaviours such as learning to grasp objects [35] or learning to fly [23] by avoiding crashing can be acquired by reward engineering.
- As the agent is learned off-line, good performance on the real robot is a powerful indicator of generalization
- To this end, the authors conducted controlled evaluations on the physical robot with fixed initial conditions across different policies.
- The hard and unseen conditions are especially challenging, since they require the agent to cope with novel objects and novel object configurations
Conclusion:
The authors have proposed a new data-driven approach to robotics.- Its key components include a method for reward learning, retrospective reward labelling and batch RL with distributional value functions.
- To further advance data-driven robotics, reward learning and batch RL, the authors release the large datasets [16] from NeverEnding Storage and canonical agents [28].
- The authors' results across a wide set of tasks illustrate the versatility of the data-driven approach.
- The learned agents showed a significant degree of generalization and robustness
Tables
- Table1: Dataset statistics. Total includes off-task data not listed in individual rows, teleoperation and tasks lift_green, stack_green_on_red, lift_cloth partly overlap
- Table2: The success rate of our agent and ablations for a given task in different difficulty settings. Recall that out agent is trained off-line
Related work
- RL has a long history in robotics [37, 53, 34, 25, 42, 43, 35]. However, applying RL in this domain inherits all the general difficulties of applying RL in the real world [18]. Most published works either rely on state estimation for a specific task, or work in a very limited regime to learn from raw observations. These methods typically entail highly engineered reward functions. In our work, we go beyond the usual scale of application of RL to robotics, learn from raw observations and without predefined rewards.
Batch RL trains policies from a fixed dataset and, thus, it is particularly useful in real-world applications like robotics. It is currently an active area of research (see the work of Lange et al [41] for an overview), with a number of recent works aimed at improving the stability [22, 30, 1, 40].
Reference
- Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. Striving for simplicity in off-policy deep reinforcement learning. arXiv preprint arXiv:1907.04543, 2019.
- Riad Akrour, Marc Schoenauer, and Michèle Sebag. APRIL: Active preference learning-based reinforcement learning. In ECMLPKDD, pages 116–131, 2012.
- Riad Akrour, Marc Schoenauer, Michele Sebag, and JeanChristophe Souplet. Programming by feedback. In International Conference on Machine Learning, pages 1503–1511, 2014.
- Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in English and Mandarin. In International Conference on Machine Learning, pages 173–182, 2016.
- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Nir Baram, Oron Anschel, Itai Caspi, and Shie Mannor. End-toend differentiable adversarial imitation learning. In International Conference on Machine Learning, pages 390–399, 2017.
- Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. In International Conference on Learning Representations, 2018.
- Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458, 2017.
- Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- Eric Brochu, Nando de Freitas, and Abhijeet Ghosh. Active preference learning with discrete choice data. In Advances on Neural Information Processing Systems, pages 409–416, 2007.
- Eric Brochu, Tyson Brochu, and Nando de Freitas. A Bayesian interactive optimization approach to procedural animation design. In SIGGRAPH Symposium on Computer Animation, pages 103– 112, 2010.
- Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International Conference on Machine Learning, pages 783–792, 2019.
- Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances on Neural Information Processing Systems, pages 4299–4307, 2017.
- Wei Chu and Zoubin Ghahramani. Preference learning with Gaussian processes. In International Conference on Machine Learning, pages 137–144, 2005.
- Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. RoboNet: Large-scale multi-robot learning. In Conference on Robot Learning, 2019.
- DeepMind. Sketchy data, 2020. URL https://github.com/deepmind/deepmind-research/tree/master/sketchy.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, pages 4171–4186, 2019.
- Gabriel Dulac-Arnold, Daniel J. Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901, 2019.
- Stephen E Feinberg and Knley Larntz. Log-linear representation for paired and multiple comparison models. Biometrika, 63(2): 245–254, 1976.
- Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016.
- Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. In International Conference on Learning Representations, 2018.
- Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. arXiv e-prints, art. arXiv:1812.02900, 2018.
- Dhiraj Gandhi, Lerrel Pinto, and Abhinav Gupta. Learning to fly by crashing. In International Conference on Intelligent Robots and Systems, pages 3948–3955, 2017.
- Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning, pages 1764–1772, 2014.
- Roland Hafner and Martin Riedmiller. Reinforcement learning in feedback control. Machine learning, 84(1-2):137–169, 2011.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances on Neural Information Processing Systems, pages 4565–4573, 2016.
- Matt Hoffman, Bobak Shahriari, John Aslanides, Gabriel BarthMaron, Feryal Behbahani, Tamara Norman, Abbas Abdolmaleki, Albin Cassirer, Fan Yang, Kate Baumli, Sarah Henderson, Alex Novikov, Sergio GÃsmez Colmenarejo, Serkan Cabi, Caglar Gulcehre, Tom Le Paine, Andrew Cowie, Ziyu Wang, Bilal Piot, and Nando de Freitas. Acme: A research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020.
- Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in Atari. In Advances on Neural Information Processing Systems, pages 8011–8023, 2018.
- Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
- Rae Jeong, Yusuf Aytar, David Khosid, Yuxiang Zhou, Jackie Kay, Thomas Lampe, Konstantinos Bousmalis, and Francesco Nori. Self-supervised sim-to-real adaptation for visual robotic manipulation. arXiv preprint arXiv:1910.09470, 2019.
- Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Transactions on Information Systems, 25(2), 2007.
- Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In IEEE Computer Vision and Pattern Recognition, 2017.
- Mrinal Kalakrishnan, Ludovic Righetti, Peter Pastor, and Stefan Schaal. Learning force control policies for compliant manipulation. In International Conference on Intelligent Robots and Systems, pages 4639–4644, 2011.
- Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651–673, 2018.
- Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, 2018.
- Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
- Yuki Koyama, Issei Sato, Daisuke Sakamoto, and Takeo Igarashi. Sequential line search for efficient visual design optimization by crowds. ACM Transactions on Graphics, 36(4):1–11, 2017.
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances on Neural Information Processing Systems, pages 1097–1105, 2012.
- Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy Q-learning via bootstrapping error reduction. In Advances on Neural Information Processing Systems, pages 11761–11771, 2019.
- Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement learning, pages 45–73.
- Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
- Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37(4-5):421–436, 2018.
- Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual demonstrations. In Advances on Neural Information Processing Systems, pages 3812–3822, 2017.
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, pages 740–755, 2014.
- Andrew Maas, Ziang Xie, Dan Jurafsky, and Andrew Ng. Lexicon-free conversational speech recognition with neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 345–354, 2015.
- Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. RoboTurk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pages 879–893, 2018.
- Josh Merel, Yuval Tassa, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg Wayne, and Nicolas Heess. Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201, 2017.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, and Georg Ostrovski et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- F Mosteller. Remarks on the method of paired comparisons: I. the least squares solution assuming equal standard deviations and equal correlations. Psychometrika, 16:3–9, 1951.
- Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. In IEEE International Conference on Robotics & Automation, pages 6292–6299, 2018.
- Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, pages 663–670, 2010.
- Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.
- Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado Van Hasselt, John Quan, Mel Vecerík, et al. Observe and look further: Achieving consistent performance on atari. arXiv preprint arXiv:1805.11593, 2018.
- Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances on Neural Information Processing Systems, pages 305–313, 1989.
- Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau Bölöni, and Sergey Levine. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. In IEEE International Conference on Robotics & Automation, pages 3758–3765, 2018.
- Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. Robotics, Science and Systems, 2018.
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- Dorsa Sadigh, Anca D. Dragan, Shankar Sastry, and Sanjit A. Seshia. Active preference-based learning of reward functions. In Robotics, Science and Systems, 2017.
- Pierre Sermanet, Kelvin Xu, and Sergey Levine. Unsupervised perceptual rewards for imitation learning. Robotics, Science and Systems, 2017.
- Pratyusha Sharma, Lekha Mohan, Lerrel Pinto, and Abhinav Gupta. Multiple interactions made easy (MIME): Large scale demonstrations data for imitation. In Conference on Robot Learning, pages 906–915, 2018.
- David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pages 387–395, 2014.
- David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
- Hal Stern. A continuum of paired comparison models. Biometrika, 77:265–273, 1990.
- Malcolm J. A. Strens and Andrew W. Moore. Policy search using paired comparisons. Journal of Machine Learning Research, 3: 921–950, 2003.
- LL Thurstone. A law of comparative judgement. Psychological Review, 34:273–286, 1927.
- Matej Vecerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
- Mel Vecerik, Oleg Sushkov, David Barker, Thomas Rothörl, Todd Hester, and Jon Scholz. A practical approach to insertion with variable socket position using deep reinforcement learning. In IEEE International Conference on Robotics & Automation, pages 754–760, 2019.
- Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojciech M Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, et al. Alphastar: Mastering the real-time strategy game StarCraft II. DeepMind Blog, 2019.
- Paul J Werbos et al. Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
- Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017.
- Markus Wulfmeier, Abbas Abdolmaleki, Roland Hafner, Jost Tobias Springenberg, Michael Neunert, Tim Hertweck, Thomas Lampe, Noah Siegel, Nicolas Heess, and Martin Riedmiller. Regularized hierarchical policies for compositional transfer in robotics. arXiv preprint arXiv:1906.11228, 2019.
- Yuke Zhu, Ziyu Wang, Josh Merel, Andrei Rusu, Tom Erez, Serkan Cabi, Saran Tunyasuvunakool, János Kramár, Raia Hadsell, Nando de Freitas, et al. Reinforcement and imitation learning for diverse visuomotor skills. Robotics, Science and Systems, 2018.
Tags
Comments