Efficient Deep Reinforcement Learning via Adaptive Policy Transfer

IJCAI, pp. 3094-3100, 2020.

Cited by: 0|Bibtex|Views167|Links
EI
Keywords:
Probably Approximately CorrectPolicy GradientMachine Learning: Reinforcement Learningaction spacetransfer learningMore(16+)
Weibo:
We propose a novel Policy Transfer Framework by taking advantage of this idea

Abstract:

Transfer learning has shown great potential to accelerate Reinforcement Learning (RL) by leveraging prior knowledge from past learned policies of relevant tasks. Existing approaches either transfer previous knowledge by explicitly computing similarities between tasks or select appropriate source policies to provide guided explorations. Ho...More

Code:

Data:

0
Introduction
  • Recent advance in Deep Reinforcement Learning (DRL) has obtained expressive success of achieving human-level control in complex tasks [Mnih et al, 2015; Lillicrap et al, 2016].
  • One major direction of transfer in RL focused on measuring the similarity between two tasks either through mapping the state spaces between two tasks [Brys et al, 2015], or computing the similarity of two Markov Decision Processes (MDPs) [Song et al, 2016], and transferring value functions directly according to their similarities
  • Another direction of policy transfer focuses on selecting a suitable source policy for explorations using a probabilistic exploration strategy [Fernandez and Veloso, 2006] or multiarmed bandit methods [Li and Zhang, 2018].
  • Using Q-function, the gradient of the policy can be written as:
Highlights
  • Recent advance in Deep Reinforcement Learning (DRL) has obtained expressive success of achieving human-level control in complex tasks [Mnih et al, 2015; Lillicrap et al, 2016]
  • Deep Reinforcement Learning is still faced with sample inefficiency problems especially when the state-action space becomes large, which makes it difficult to learn from scratch
  • One major direction of transfer in Reinforcement Learning focused on measuring the similarity between two tasks either through mapping the state spaces between two tasks [Brys et al, 2015], or computing the similarity of two Markov Decision Processes (MDPs) [Song et al, 2016], and transferring value functions directly according to their similarities
  • The main contributions of our work are: 1) Policy Transfer Framework learns when and which source policy is the best to reuse for the target policy and when to terminate it by modeling multi-policy transfer as the option learning problem; 2) we propose an adaptive and heuristic mechanism to ensure the efficient reuse of source policies and avoid negative transfer; and 3) both existing value-based and policy-based Deep Reinforcement Learning approaches can be incorporated and experimental results show Policy Transfer Framework significantly boosts the performance of existing Deep Reinforcement Learning approaches, and outperforms state-of-the-art policy transfer methods both in discrete and continuous action spaces
  • We evaluate Policy Transfer Framework on three domains, grid world [Li et al, 2019], pinball [Bacon et al, 2017] and reacher [Tassa et al, 2018] compared with several Deep Reinforcement Learning methods learning from scratch (A3C [Mnih et al, 2016] and PPO [Schulman et al, 2017]); and the state-of-the-art policy transfer method Context-Aware Policy reuSe [Li et al, 2019], implemented as a deep version (Deep-Context-Aware Policy reuSe)
  • We propose a Policy Transfer Framework (PTF) which can efficiently select the optimal source policy and exploit the useful information to facilitate the target task learning
Results
  • The authors evaluate PTF on three domains, grid world [Li et al, 2019], pinball [Bacon et al, 2017] and reacher [Tassa et al, 2018] compared with several DRL methods learning from scratch (A3C [Mnih et al, 2016] and PPO [Schulman et al, 2017]); and the state-of-the-art policy transfer method CAPS [Li et al, 2019], implemented as a deep version (Deep-CAPS).
  • 4.1 Grid World.
  • In a grid world W, with an agent starting from any of the grids, and choosing one of four actions: up, down, left and right.
  • Each action makes the agent move to the corresponding direction with one step size.
  • G1, G2, G3 and G4 denote goals of source tasks, g and g represent goals of target tasks.
  • G is similar to one of the source tasks G1 since their goals are within a close
Conclusion
  • The authors propose a Policy Transfer Framework (PTF) which can efficiently select the optimal source policy and exploit the useful information to facilitate the target task learning.
  • PTF efficiently avoids negative transfer through terminating the exploitation of current source policy and selects another one adaptively.
  • PTF can be combined with existing deep DRL methods.
  • Experimental results show PTF efficiently accelerates the learning process of existing state-ofthe-art DRL methods and outperforms previous policy reuse approaches.
  • It is worthwhile investigating how to extend PTF to multiagent settings, and how to learn abstract knowledge for fast adaptation in new environments
Summary
  • Introduction:

    Recent advance in Deep Reinforcement Learning (DRL) has obtained expressive success of achieving human-level control in complex tasks [Mnih et al, 2015; Lillicrap et al, 2016].
  • One major direction of transfer in RL focused on measuring the similarity between two tasks either through mapping the state spaces between two tasks [Brys et al, 2015], or computing the similarity of two Markov Decision Processes (MDPs) [Song et al, 2016], and transferring value functions directly according to their similarities
  • Another direction of policy transfer focuses on selecting a suitable source policy for explorations using a probabilistic exploration strategy [Fernandez and Veloso, 2006] or multiarmed bandit methods [Li and Zhang, 2018].
  • Using Q-function, the gradient of the policy can be written as:
  • Results:

    The authors evaluate PTF on three domains, grid world [Li et al, 2019], pinball [Bacon et al, 2017] and reacher [Tassa et al, 2018] compared with several DRL methods learning from scratch (A3C [Mnih et al, 2016] and PPO [Schulman et al, 2017]); and the state-of-the-art policy transfer method CAPS [Li et al, 2019], implemented as a deep version (Deep-CAPS).
  • 4.1 Grid World.
  • In a grid world W, with an agent starting from any of the grids, and choosing one of four actions: up, down, left and right.
  • Each action makes the agent move to the corresponding direction with one step size.
  • G1, G2, G3 and G4 denote goals of source tasks, g and g represent goals of target tasks.
  • G is similar to one of the source tasks G1 since their goals are within a close
  • Conclusion:

    The authors propose a Policy Transfer Framework (PTF) which can efficiently select the optimal source policy and exploit the useful information to facilitate the target task learning.
  • PTF efficiently avoids negative transfer through terminating the exploitation of current source policy and selects another one adaptively.
  • PTF can be combined with existing deep DRL methods.
  • Experimental results show PTF efficiently accelerates the learning process of existing state-ofthe-art DRL methods and outperforms previous policy reuse approaches.
  • It is worthwhile investigating how to extend PTF to multiagent settings, and how to learn abstract knowledge for fast adaptation in new environments
Funding
  • The work is supported by the National Natural Science Foundation of China (Grant Nos.: 61702362, U1836214, 61876119), the new Generation of Artificial Intelligence Science and Technology Major Project of Tianjin under grant: 19ZXZNGX00010, and the Natural Science Foundation of Jiangsu under Grant No BK20181432
Reference
  • [Bacon et al., 2017] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 1726–1734, 2017.
    Google ScholarLocate open access versionFindings
  • [Brunskill and Li, 2014] Emma Brunskill and Lihong Li. PAC-inspired option discovery in lifelong reinforcement learning. In Proceedings of International Conference on Machine Learning, pages 316–324, 2014.
    Google ScholarLocate open access versionFindings
  • [Brys et al., 2015] Tim Brys, Anna Harutyunyan, Matthew E Taylor, and Ann Nowe. Policy transfer using reward shaping. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems, pages 181–188, 2015.
    Google ScholarLocate open access versionFindings
  • [Fernandez and Veloso, 2006] Fernando Fernandez and Manuela Veloso. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems, pages 720–727, 2006.
    Google ScholarLocate open access versionFindings
  • [Jaderberg et al., 2017] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu. Population based training of neural networks. CoRR, abs/1711.09846, 2017.
    Findings
  • [Laroche and Barlier, 2017] Romain Laroche and Merwan Barlier. Transfer reinforcement learning with shared dynamics. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 2147–2153, 2017.
    Google ScholarLocate open access versionFindings
  • [Li and Zhang, 2018] Siyuan Li and Chongjie Zhang. An optimal online method of selecting source policies for reinforcement learning. In Proceedings of AAAI Conference on Artificial Intelligence, pages 3562–3570, 2018.
    Google ScholarLocate open access versionFindings
  • [Li et al., 2019] Siyuan Li, Fangda Gu, Guangxiang Zhu, and Chongjie Zhang. Context-aware policy reuse. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, pages 989–997, 2019.
    Google ScholarLocate open access versionFindings
  • [Lillicrap et al., 2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Proceedings of International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • [Mann et al., 2014] Timothy Mann, Daniel Mankowitz, and Shie Mannor. Time-regularized interrupting options. In Proceedings of International Conference on Machine Learning, pages 1350–1358, 2014.
    Google ScholarLocate open access versionFindings
  • [Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
    Google ScholarLocate open access versionFindings
  • [Mnih et al., 2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of International Conference on Machine Learning, pages 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • [Rajendran et al., 2017] Janarthanan Rajendran, Aravind S Lakshminarayanan, Mitesh M Khapra, P Prasanna, and Balaraman Ravindran. Attend, adapt and transfer: Attentive deep architecture for adaptive transfer from multiple sources in the same domain. In Proceedings of International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • [Rusu et al., 2016] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. In Proceedings of International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • [Schmitt et al., 2018] Simon Schmitt, Jonathan J. Hudson, Augustin Zıdek, Simon Osindero, Carl Doersch, Wojciech M. Czarnecki, Joel Z. Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, and S. M. Ali Eslami. Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835, 2018.
    Findings
  • [Schulman et al., 2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • [Song et al., 2016] Jinhua Song, Yang Gao, Hao Wang, and Bo An. Measuring the distance between finite Markov decision processes. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems, pages 468–476, 2016.
    Google ScholarLocate open access versionFindings
  • [Sutton and Barto, 1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 1998.
    Google ScholarFindings
  • [Sutton et al., 1999] Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181 – 211, 1999.
    Google ScholarLocate open access versionFindings
  • [Tassa et al., 2018] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. DeepMind Control Suite. CoRR, abs/1801.00690, 2018.
    Findings
  • [Thomas, 2014] Philip Thomas. Bias in natural actor-critic algorithms. In Proceedings of International Conference on Machine Learning, pages 441–448, 2014.
    Google ScholarLocate open access versionFindings
  • [Todorov et al., 2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Proceedings of the International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments