Efficient Deep Reinforcement Learning through Policy Transfer

Cited by: 0|Bibtex|Views62|Links
Keywords:
relevant taskappropriate sourceprior knowledgeContext-Aware Policy reuSePolicy GradientMore(23+)
Weibo:
We propose a novel Policy Transfer Framework to accelerate Reinforcement Learning by taking advantage of this idea

Abstract:

Transfer Learning (TL) has shown great potential to accelerate Reinforcement Learning (RL) by leveraging prior knowledge from past learned policies of relevant tasks. Existing transfer approaches either explicitly computes the similarity between tasks or select appropriate source policies to provide guided explorations for the target ta...More

Code:

Data:

0
Introduction
  • Recent advance in Deep Reinforcement Learning (DRL) has obtained expressive success of achieving human-level control in complex tasks [15, 19].
  • One major direction of transfer in RL focused on measuring the similarity between two tasks either through mapping the state spaces between two tasks [3, 30], or computing the similarity of two Markov Decision Processes (MDPs) [25], and transferring value functions directly according to their similarities.
Highlights
  • Recent advance in Deep Reinforcement Learning (DRL) has obtained expressive success of achieving human-level control in complex tasks [15, 19]
  • One major direction of transfer in RL focused on measuring the similarity between two tasks either through mapping the state spaces between two tasks [3, 30], or computing the similarity of two Markov Decision Processes (MDPs) [25], and transferring value functions directly according to their similarities
  • The main contributions of our work are: 1) Policy Transfer Framework (PTF) learns when and which source policy is the best to reuse for the target policy and when to terminate it by modelling multi-policy transfer as the option learning problem; 2) we propose an adaptive and heuristic mechanism to ensure the efficient reuse of source policies and avoid negative transfer; and 3) both existing value-based and policy-based DRL approaches can be incorporated and experimental results show PTF significantly boosts the performance of existing DRL approaches, and outperforms state-of-the-art policy transfer methods both in discrete and continuous action spaces
  • This paper focuses on standard RL tasks, formally, a task can be specified by an Markov Decision Process (MDP), which can be described as a tuple < S, A,T, R >, where S is the set of states; A is the set of actions; T is the state transition function: S × A × S →
  • We evaluate PTF on three test domains, grid world [4], pinball [11] and reacher [28] compared with several DRL methods learning from scratch (A3C [17] and PPO [24]); and the state-ofthe-art policy transfer method Context-Aware Policy reuSe (CAPS) [13], implemented as a deep version (Deep-CAPS)
  • We propose a Policy Transfer Framework (PTF) which can efficiently select the optimal source policy and exploit the useful information to facilitate the target task learning
Results
  • The authors evaluate PTF on three test domains, grid world [4], pinball [11] and reacher [28] compared with several DRL methods learning from scratch (A3C [17] and PPO [24]); and the state-ofthe-art policy transfer method CAPS [13], implemented as a deep version (Deep-CAPS).
  • Results are averaged over 20 random seeds 1.
  • (a) Grid world W (b) Grid world W ′
Conclusion
  • The authors propose a Policy Transfer Framework (PTF) which can efficiently select the optimal source policy and exploit the useful information to facilitate the target task learning.
  • PTF efficiently avoids negative transfer through terminating the exploitation of current source policy and selects another one adaptively.
  • It is worthwhile investigating how to extend PTF to multiagent settings.
  • Another interesting direction is how to learn abstract knowledge for fast adaptation in new environments
Summary
  • Introduction:

    Recent advance in Deep Reinforcement Learning (DRL) has obtained expressive success of achieving human-level control in complex tasks [15, 19].
  • One major direction of transfer in RL focused on measuring the similarity between two tasks either through mapping the state spaces between two tasks [3, 30], or computing the similarity of two Markov Decision Processes (MDPs) [25], and transferring value functions directly according to their similarities.
  • Results:

    The authors evaluate PTF on three test domains, grid world [4], pinball [11] and reacher [28] compared with several DRL methods learning from scratch (A3C [17] and PPO [24]); and the state-ofthe-art policy transfer method CAPS [13], implemented as a deep version (Deep-CAPS).
  • Results are averaged over 20 random seeds 1.
  • (a) Grid world W (b) Grid world W ′
  • Conclusion:

    The authors propose a Policy Transfer Framework (PTF) which can efficiently select the optimal source policy and exploit the useful information to facilitate the target task learning.
  • PTF efficiently avoids negative transfer through terminating the exploitation of current source policy and selects another one adaptively.
  • It is worthwhile investigating how to extend PTF to multiagent settings.
  • Another interesting direction is how to learn abstract knowledge for fast adaptation in new environments
Tables
  • Table1: CAPS Hyperparameters
  • Table2: A3C Hyperparameters
  • Table3: PPO Hyperparameters
  • Table4: PTF Hyperparameters
Download tables as Excel
Related work
  • Recently, transfer in RL has become an important direction and a wide variety of methods have been studied in the context of RL transfer learning [29]. Brys et al [3] applied a reward shaping approach to policy transfer, benefiting from the theoretical guarantees of reward shaping. However, it may suffer from negative transfer. Song et al [25] transferred the action-value functions of the source tasks to the target task according to a task similarity metric to compute the task distance. However, they assumed a well-estimated model which is not always available in practice. Later, Laroche et al [12] reused the experience instances of a source task to estimate the reward function of the target task. The limitation of this approach resides in the restrictive assumption that all the tasks share the same transition dynamics and differ only in the reward function.
Funding
  • The work is supported by the National Natural Science Foundation of China (Grant Nos.: 61702362, U1836214)
Reference
  • Pierre-Luc Bacon, Jean Harb, and Doina Precup. 2017. The Option-Critic Architecture. In Proceedings of AAAI Conference on Artificial Intelligence. 1726–1734.
    Google ScholarLocate open access versionFindings
  • Emma Brunskill and Lihong Li. 2014. PAC-inspired option discovery in lifelong reinforcement learning. In Proceedings of International Conference on Machine Learning. 316–324.
    Google ScholarLocate open access versionFindings
  • Tim Brys, Anna Harutyunyan, Matthew E Taylor, and Ann Nowé. 2015. Policy transfer using reward shaping. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems. 181–188.
    Google ScholarLocate open access versionFindings
  • Fernando Fernández and Manuela Veloso. 2006. Probabilistic Policy Reuse in a Reinforcement Learning Agent. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems. 720–727.
    Google ScholarLocate open access versionFindings
  • Jean Harb, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup. 2018. When Waiting Is Not an Option: Learning Options With a Deliberation Cost. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. 3165–3172.
    Google ScholarLocate open access versionFindings
  • Anna Harutyunyan, Will Dabney, Diana Borsa, Nicolas Heess, Rémi Munos, and Doina Precup. 2019. The Termination Critic. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 2231–2240.
    Google ScholarLocate open access versionFindings
  • Karol Hausman, Yevgen Chebotar, Stefan Schaal, Gaurav S. Sukhatme, and Joseph J. Lim. 201Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets. In Advances in Neural Information Processing Systems. 1235–1245.
    Google ScholarLocate open access versionFindings
  • Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu. 2017. Population Based Training of Neural Networks. CoRR abs/1711.09846 (2017).
    Findings
  • Thomas Kipf, Yujia Li, Hanjun Dai, Vinícius Flores Zambaldi, Alvaro SanchezGonzalez, Edward Grefenstette, Pushmeet Kohli, and Peter W. Battaglia. 201CompILE: Compositional Imitation Learning and Execution. In Proceedings of the 36th International Conference on Machine Learning. 3418–3428.
    Google ScholarLocate open access versionFindings
  • Martin Klissarov, Pierre-Luc Bacon, Jean Harb, and Doina Precup. 2017. Learnings Options End-to-End for Continuous Action Tasks. CoRR abs/1712.00004 (2017).
    Findings
  • George Konidaris and Andrew G Barto. 2009. Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in Neural Information Processing Systems. 1015–1023.
    Google ScholarLocate open access versionFindings
  • Romain Laroche and Merwan Barlier. 2017. Transfer Reinforcement Learning with Shared Dynamics. In Proceedings of AAAI Conference on Artificial Intelligence. 2147–2153.
    Google ScholarLocate open access versionFindings
  • Siyuan Li, Fangda Gu, Guangxiang Zhu, and Chongjie Zhang. 2019. ContextAware Policy Reuse. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems. 989–997.
    Google ScholarLocate open access versionFindings
  • Siyuan Li and Chongjie Zhang. 2018. An Optimal Online Method of Selecting Source Policies for Reinforcement Learning. In Proceedings of AAAI Conference on Artificial Intelligence. 3562–3570.
    Google ScholarLocate open access versionFindings
  • Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In Proceedings of International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Timothy Mann, Daniel Mankowitz, and Shie Mannor. 2014. Time-regularized interrupting options (TRIO). In Proceedings of International Conference on Machine Learning. 1350–1358.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of International Conference on Machine Learning. 1928–1937.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. CoRR abs/1312.5602 (2013).
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
    Google ScholarLocate open access versionFindings
  • Janarthanan Rajendran, Aravind S Lakshminarayanan, Mitesh M Khapra, P Prasanna, and Balaraman Ravindran. 2017. Attend, Adapt and Transfer: Attentive Deep Architecture for Adaptive Transfer from multiple sources in the same domain. In Proceedings of International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. 2016. Policy distillation. In Proceedings of International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Himanshu Sahni, Saurabh Kumar, Farhan Tejani, and Charles L. Isbell Jr. 2017. Learning to Compose Skills. CoRR abs/1711.11289 (2017).
    Findings
  • Simon Schmitt, Jonathan J. Hudson, Augustin Zídek, Simon Osindero, Carl Doersch, Wojciech M. Czarnecki, Joel Z. Leibo, Heinrich Küttler, Andrew Zisserman, Karen Simonyan, and S. M. Ali Eslami. 2018. Kickstarting Deep Reinforcement Learning. arXiv preprint arXiv:1803.03835 (2018).
    Findings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
    Findings
  • Jinhua Song, Yang Gao, Hao Wang, and Bo An. 2016. Measuring the distance between finite Markov decision processes. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems. 468–476.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. MIT press.
    Google ScholarFindings
  • Richard S. Sutton, Doina Precup, and Satinder Singh. 1999. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112, 1 (1999), 181 – 211.
    Google ScholarLocate open access versionFindings
  • Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. 2018. DeepMind Control Suite. CoRR abs/1801.00690 (2018).
    Findings
  • Matthew E Taylor and Peter Stone. 2009. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research 10, Jul (2009), 1633–1685.
    Google ScholarLocate open access versionFindings
  • Matthew E Taylor, Peter Stone, and Yaxin Liu. 2007. Transfer learning via intertask mappings for temporal difference learning. Journal of Machine Learning Research 8, Sep (2007), 2125–2167.
    Google ScholarLocate open access versionFindings
  • Philip Thomas. 2014. Bias in Natural Actor-Critic Algorithms. In Proceedings of
    Google ScholarLocate open access versionFindings
  • Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. MuJoCo: A physics engine for model-based control. In Proceedings of the International Conference on Intelligent Robots and Systems. 5026–5033.
    Google ScholarLocate open access versionFindings
  • Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 8, 3-4 (1992), 279–292.
    Google ScholarLocate open access versionFindings
  • Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 3-4 (1992), 229–256.
    Google ScholarLocate open access versionFindings
  • Haiyan Yin and Sinno Jialin Pan. 2017. Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay. In Proceedings of AAAI Conference on Artificial Intelligence. 1640–1646.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments