Scaling data-driven robotics with reward sketching and batch reinforcement learning

robotics science and systems, 2020.

Cited by: 2|Bibtex|Views98|Links
Keywords:
real robotspeech recognitionbehaviour cloningdeep reinforcement learninghuman preferenceMore(7+)
Weibo:
Its key components include a method for reward learning, retrospective reward labelling and batch reinforcement learning with distributional value functions

Abstract:

By harnessing a growing dataset of robot experience, we learn control policies for a diverse and increasing set of related manipulation tasks. To make this possible, we introduce reward sketching: an effective way of eliciting human preferences to learn the reward function for a new task. This reward function is then used to retrospective...More

Code:

Data:

0
Introduction
  • Deep learning has successfully advanced many areas of artificial intelligence, including vision [39, 26], speech recognition [24, 46, 4], natural language processing [17], and reinforcement learning (RL) [49, 63].
  • The success of deep learning in each of these fields was made possible by the availability of huge amounts of labeled training data.
  • In simulated environments like video games, where experience and rewards are easy to obtain, deep RL is tremendously successful in outperforming top skilled humans by ingesting huge amounts of data [63, 69, 9].
  • The lack of large datasets with reward signals has limited the effectiveness of deep RL in robotics
Highlights
  • Deep learning has successfully advanced many areas of artificial intelligence, including vision [39, 26], speech recognition [24, 46, 4], natural language processing [17], and reinforcement learning (RL) [49, 63]
  • Evaluation: While the reward and policy are learned from data, we cannot assess their ultimate quality without running the agent on the real robot
  • As the agent is learned off-line, good performance on the real robot is a powerful indicator of generalization
  • Its key components include a method for reward learning, retrospective reward labelling and batch reinforcement learning with distributional value functions
  • We found that reward sketching is an effective way to elicit reward functions, since humans are good at judging progress toward a goal
  • Diversity of training data seems to be an essential factor in the success of standard state-of-the-art reinforcement learning algorithms, which were previously reported to fail when trained only on expert data or the history of a single agent [22]
Methods
  • The general workflow is illustrated in Fig. 1 and a more detailed procedure is presented in Fig. 5.
  • A task-specific reward model allows them to retrospectively annotate data in NES with reward signals for a new task.
  • The authors can train batch RL agents with all the data in NES.
  • The procedure for training an agent to complete a new task has the following steps which are described in turn in the remainder of the section: A.
  • A human teleoperates the robot to provide first-person demonstrations of the target task
Results
  • Existing RL approaches for real-world robotics mainly focus on tasks where hand-crafted reward mechanisms can be developed.
  • Simple behaviours such as learning to grasp objects [35] or learning to fly [23] by avoiding crashing can be acquired by reward engineering.
  • As the agent is learned off-line, good performance on the real robot is a powerful indicator of generalization
  • To this end, the authors conducted controlled evaluations on the physical robot with fixed initial conditions across different policies.
  • The hard and unseen conditions are especially challenging, since they require the agent to cope with novel objects and novel object configurations
Conclusion
  • The authors have proposed a new data-driven approach to robotics.
  • Its key components include a method for reward learning, retrospective reward labelling and batch RL with distributional value functions.
  • To further advance data-driven robotics, reward learning and batch RL, the authors release the large datasets [16] from NeverEnding Storage and canonical agents [28].
  • The authors' results across a wide set of tasks illustrate the versatility of the data-driven approach.
  • The learned agents showed a significant degree of generalization and robustness
Summary
  • Introduction:

    Deep learning has successfully advanced many areas of artificial intelligence, including vision [39, 26], speech recognition [24, 46, 4], natural language processing [17], and reinforcement learning (RL) [49, 63].
  • The success of deep learning in each of these fields was made possible by the availability of huge amounts of labeled training data.
  • In simulated environments like video games, where experience and rewards are easy to obtain, deep RL is tremendously successful in outperforming top skilled humans by ingesting huge amounts of data [63, 69, 9].
  • The lack of large datasets with reward signals has limited the effectiveness of deep RL in robotics
  • Methods:

    The general workflow is illustrated in Fig. 1 and a more detailed procedure is presented in Fig. 5.
  • A task-specific reward model allows them to retrospectively annotate data in NES with reward signals for a new task.
  • The authors can train batch RL agents with all the data in NES.
  • The procedure for training an agent to complete a new task has the following steps which are described in turn in the remainder of the section: A.
  • A human teleoperates the robot to provide first-person demonstrations of the target task
  • Results:

    Existing RL approaches for real-world robotics mainly focus on tasks where hand-crafted reward mechanisms can be developed.
  • Simple behaviours such as learning to grasp objects [35] or learning to fly [23] by avoiding crashing can be acquired by reward engineering.
  • As the agent is learned off-line, good performance on the real robot is a powerful indicator of generalization
  • To this end, the authors conducted controlled evaluations on the physical robot with fixed initial conditions across different policies.
  • The hard and unseen conditions are especially challenging, since they require the agent to cope with novel objects and novel object configurations
  • Conclusion:

    The authors have proposed a new data-driven approach to robotics.
  • Its key components include a method for reward learning, retrospective reward labelling and batch RL with distributional value functions.
  • To further advance data-driven robotics, reward learning and batch RL, the authors release the large datasets [16] from NeverEnding Storage and canonical agents [28].
  • The authors' results across a wide set of tasks illustrate the versatility of the data-driven approach.
  • The learned agents showed a significant degree of generalization and robustness
Tables
  • Table1: Dataset statistics. Total includes off-task data not listed in individual rows, teleoperation and tasks lift_green, stack_green_on_red, lift_cloth partly overlap
  • Table2: The success rate of our agent and ablations for a given task in different difficulty settings. Recall that out agent is trained off-line
Download tables as Excel
Related work
  • RL has a long history in robotics [37, 53, 34, 25, 42, 43, 35]. However, applying RL in this domain inherits all the general difficulties of applying RL in the real world [18]. Most published works either rely on state estimation for a specific task, or work in a very limited regime to learn from raw observations. These methods typically entail highly engineered reward functions. In our work, we go beyond the usual scale of application of RL to robotics, learn from raw observations and without predefined rewards.

    Batch RL trains policies from a fixed dataset and, thus, it is particularly useful in real-world applications like robotics. It is currently an active area of research (see the work of Lange et al [41] for an overview), with a number of recent works aimed at improving the stability [22, 30, 1, 40].
Reference
  • Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. Striving for simplicity in off-policy deep reinforcement learning. arXiv preprint arXiv:1907.04543, 2019.
    Findings
  • Riad Akrour, Marc Schoenauer, and Michèle Sebag. APRIL: Active preference learning-based reinforcement learning. In ECMLPKDD, pages 116–131, 2012.
    Google ScholarLocate open access versionFindings
  • Riad Akrour, Marc Schoenauer, Michele Sebag, and JeanChristophe Souplet. Programming by feedback. In International Conference on Machine Learning, pages 1503–1511, 2014.
    Google ScholarLocate open access versionFindings
  • Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in English and Mandarin. In International Conference on Machine Learning, pages 173–182, 2016.
    Google ScholarLocate open access versionFindings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
    Findings
  • Nir Baram, Oron Anschel, Itai Caspi, and Shie Mannor. End-toend differentiable adversarial imitation learning. In International Conference on Machine Learning, pages 390–399, 2017.
    Google ScholarLocate open access versionFindings
  • Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458, 2017.
    Google ScholarLocate open access versionFindings
  • Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
    Findings
  • Eric Brochu, Nando de Freitas, and Abhijeet Ghosh. Active preference learning with discrete choice data. In Advances on Neural Information Processing Systems, pages 409–416, 2007.
    Google ScholarLocate open access versionFindings
  • Eric Brochu, Tyson Brochu, and Nando de Freitas. A Bayesian interactive optimization approach to procedural animation design. In SIGGRAPH Symposium on Computer Animation, pages 103– 112, 2010.
    Google ScholarLocate open access versionFindings
  • Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International Conference on Machine Learning, pages 783–792, 2019.
    Google ScholarLocate open access versionFindings
  • Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances on Neural Information Processing Systems, pages 4299–4307, 2017.
    Google ScholarLocate open access versionFindings
  • Wei Chu and Zoubin Ghahramani. Preference learning with Gaussian processes. In International Conference on Machine Learning, pages 137–144, 2005.
    Google ScholarLocate open access versionFindings
  • Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. RoboNet: Large-scale multi-robot learning. In Conference on Robot Learning, 2019.
    Google ScholarLocate open access versionFindings
  • DeepMind. Sketchy data, 2020. URL https://github.com/deepmind/deepmind-research/tree/master/sketchy.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, pages 4171–4186, 2019.
    Google ScholarLocate open access versionFindings
  • Gabriel Dulac-Arnold, Daniel J. Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901, 2019.
    Findings
  • Stephen E Feinberg and Knley Larntz. Log-linear representation for paired and multiple comparison models. Biometrika, 63(2): 245–254, 1976.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016.
    Google ScholarLocate open access versionFindings
  • Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. arXiv e-prints, art. arXiv:1812.02900, 2018.
    Findings
  • Dhiraj Gandhi, Lerrel Pinto, and Abhinav Gupta. Learning to fly by crashing. In International Conference on Intelligent Robots and Systems, pages 3948–3955, 2017.
    Google ScholarLocate open access versionFindings
  • Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning, pages 1764–1772, 2014.
    Google ScholarLocate open access versionFindings
  • Roland Hafner and Martin Riedmiller. Reinforcement learning in feedback control. Machine learning, 84(1-2):137–169, 2011.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Computer Vision and Pattern Recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances on Neural Information Processing Systems, pages 4565–4573, 2016.
    Google ScholarLocate open access versionFindings
  • Matt Hoffman, Bobak Shahriari, John Aslanides, Gabriel BarthMaron, Feryal Behbahani, Tamara Norman, Abbas Abdolmaleki, Albin Cassirer, Fan Yang, Kate Baumli, Sarah Henderson, Alex Novikov, Sergio GÃsmez Colmenarejo, Serkan Cabi, Caglar Gulcehre, Tom Le Paine, Andrew Cowie, Ziyu Wang, Bilal Piot, and Nando de Freitas. Acme: A research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020.
    Findings
  • Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in Atari. In Advances on Neural Information Processing Systems, pages 8011–8023, 2018.
    Google ScholarLocate open access versionFindings
  • Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
    Findings
  • Rae Jeong, Yusuf Aytar, David Khosid, Yuxiang Zhou, Jackie Kay, Thomas Lampe, Konstantinos Bousmalis, and Francesco Nori. Self-supervised sim-to-real adaptation for visual robotic manipulation. arXiv preprint arXiv:1910.09470, 2019.
    Findings
  • Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Transactions on Information Systems, 25(2), 2007.
    Google ScholarLocate open access versionFindings
  • Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In IEEE Computer Vision and Pattern Recognition, 2017.
    Google ScholarLocate open access versionFindings
  • Mrinal Kalakrishnan, Ludovic Righetti, Peter Pastor, and Stefan Schaal. Learning force control policies for compliant manipulation. In International Conference on Intelligent Robots and Systems, pages 4639–4644, 2011.
    Google ScholarLocate open access versionFindings
  • Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651–673, 2018.
    Google ScholarLocate open access versionFindings
  • Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
    Google ScholarLocate open access versionFindings
  • Yuki Koyama, Issei Sato, Daisuke Sakamoto, and Takeo Igarashi. Sequential line search for efficient visual design optimization by crowds. ACM Transactions on Graphics, 36(4):1–11, 2017.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances on Neural Information Processing Systems, pages 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy Q-learning via bootstrapping error reduction. In Advances on Neural Information Processing Systems, pages 11761–11771, 2019.
    Google ScholarLocate open access versionFindings
  • Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement learning, pages 45–73.
    Google ScholarLocate open access versionFindings
  • Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
    Google ScholarLocate open access versionFindings
  • Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37(4-5):421–436, 2018.
    Google ScholarLocate open access versionFindings
  • Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual demonstrations. In Advances on Neural Information Processing Systems, pages 3812–3822, 2017.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, pages 740–755, 2014.
    Google ScholarLocate open access versionFindings
  • Andrew Maas, Ziang Xie, Dan Jurafsky, and Andrew Ng. Lexicon-free conversational speech recognition with neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 345–354, 2015.
    Google ScholarLocate open access versionFindings
  • Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. RoboTurk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pages 879–893, 2018.
    Google ScholarLocate open access versionFindings
  • Josh Merel, Yuval Tassa, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg Wayne, and Nicolas Heess. Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201, 2017.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, and Georg Ostrovski et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • F Mosteller. Remarks on the method of paired comparisons: I. the least squares solution assuming equal standard deviations and equal correlations. Psychometrika, 16:3–9, 1951.
    Google ScholarLocate open access versionFindings
  • Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. In IEEE International Conference on Robotics & Automation, pages 6292–6299, 2018.
    Google ScholarLocate open access versionFindings
  • Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, pages 663–670, 2010.
    Google ScholarLocate open access versionFindings
  • Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.
    Google ScholarLocate open access versionFindings
  • Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado Van Hasselt, John Quan, Mel Vecerík, et al. Observe and look further: Achieving consistent performance on atari. arXiv preprint arXiv:1805.11593, 2018.
    Findings
  • Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances on Neural Information Processing Systems, pages 305–313, 1989.
    Google ScholarLocate open access versionFindings
  • Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau Bölöni, and Sergey Levine. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. In IEEE International Conference on Robotics & Automation, pages 3758–3765, 2018.
    Google ScholarLocate open access versionFindings
  • Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. Robotics, Science and Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • Dorsa Sadigh, Anca D. Dragan, Shankar Sastry, and Sanjit A. Seshia. Active preference-based learning of reward functions. In Robotics, Science and Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Pierre Sermanet, Kelvin Xu, and Sergey Levine. Unsupervised perceptual rewards for imitation learning. Robotics, Science and Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Pratyusha Sharma, Lekha Mohan, Lerrel Pinto, and Abhinav Gupta. Multiple interactions made easy (MIME): Large scale demonstrations data for imitation. In Conference on Robot Learning, pages 906–915, 2018.
    Google ScholarLocate open access versionFindings
  • David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pages 387–395, 2014.
    Google ScholarLocate open access versionFindings
  • David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
    Google ScholarLocate open access versionFindings
  • Hal Stern. A continuum of paired comparison models. Biometrika, 77:265–273, 1990.
    Google ScholarLocate open access versionFindings
  • Malcolm J. A. Strens and Andrew W. Moore. Policy search using paired comparisons. Journal of Machine Learning Research, 3: 921–950, 2003.
    Google ScholarLocate open access versionFindings
  • LL Thurstone. A law of comparative judgement. Psychological Review, 34:273–286, 1927.
    Google ScholarLocate open access versionFindings
  • Matej Vecerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
    Findings
  • Mel Vecerik, Oleg Sushkov, David Barker, Thomas Rothörl, Todd Hester, and Jon Scholz. A practical approach to insertion with variable socket position using deep reinforcement learning. In IEEE International Conference on Robotics & Automation, pages 754–760, 2019.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojciech M Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, et al. Alphastar: Mastering the real-time strategy game StarCraft II. DeepMind Blog, 2019.
    Google ScholarLocate open access versionFindings
  • Paul J Werbos et al. Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
    Google ScholarLocate open access versionFindings
  • Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017.
    Google ScholarLocate open access versionFindings
  • Markus Wulfmeier, Abbas Abdolmaleki, Roland Hafner, Jost Tobias Springenberg, Michael Neunert, Tim Hertweck, Thomas Lampe, Noah Siegel, Nicolas Heess, and Martin Riedmiller. Regularized hierarchical policies for compositional transfer in robotics. arXiv preprint arXiv:1906.11228, 2019.
    Findings
  • Yuke Zhu, Ziyu Wang, Josh Merel, Andrei Rusu, Tom Erez, Serkan Cabi, Saran Tunyasuvunakool, János Kramár, Raia Hadsell, Nando de Freitas, et al. Reinforcement and imitation learning for diverse visuomotor skills. Robotics, Science and Systems, 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments