Value-driven Hindsight Modelling

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views141|Links
EI
Keywords:
task relevanthigh dimensionalreinforcement learningmodel freelearning using privileged informationMore(7+)
Weibo:
We introduced a reinforcement learning algorithm, Hindsight Modelling, that leverages this insight by the following two-stage approach

Abstract:

Value estimation is a critical component of the reinforcement learning (RL) paradigm. The question of how to effectively learn predictors for value from data is one of the major problems studied by the RL community, and different approaches exploit structure in the problem domain in different ways. Model learning can make use of the ric...More

Code:

Data:

0
Introduction
  • Consider a baseball player trying to perfect their pitch. The player performs an arm motion and releases the ball towards the batter, but suppose that instead of observing where the ball lands and the reaction of the batter, the player only gets told the result of the play in terms of points or, worse, only gets told the final result of the game.
  • Model-free RL methods directly consider the relation X to Z, and focus solely upon predicting and optimising this goal, rather than attempting to learn the full dynamics
  • These methods have recently dominated the literature, and have attained the best performance in a wide array of complex problems with high-dimensional observations (Mnih et al, 2015; Schulman et al, 2017; Haarnoja et al, 2018; Guez et al, 2019).
  • This usually entails learning an estimate of vπ for the current policy π, this is the problem the authors focus on in this paper
Highlights
  • Consider a baseball player trying to perfect their pitch
  • We propose to learn a special value function in hindsight that receives future observations as an additional input
  • We introduce a new value function estimate that can only be computed at training time, the hindsight value function v+
  • We address some of these concerns with specific architectural choices like v+ having a limited view on future observations and having low dimensional φ
  • We introduced a reinforcement learning algorithm, Hindsight Modelling (HiMo), that leverages this insight by the following two-stage approach
  • A forward model is learned to predict these features, which in turn is used as input to an improved value function, yielding better policy evaluation and training at test time
Methods
  • The illustrative example in Section 3.4 demonstrated the positive effect of hindsight modeling in a simple policy evaluation setting.
  • The authors explore these benefits in the context of policy optimization in challenging domains, a custom navigation task called Portal Choice, and Atari 2600.
  • If the context matches the goal room color, a reward of 2 is given, otherwise the reward is 0 when terminating the episode
Results
  • The authors observed an increase of 132.5% in the median human normalized score compared to the R2D2 baseline with the same network capacity, aggregate results are reported in Table 1.
Conclusion
  • High-dimensional observations in the intermediate future often contain task-relevant features that can facilitate the prediction of an RL agent’s final return.
  • A forward model is learned to predict these features, which in turn is used as input to an improved value function, yielding better policy evaluation and training at test time.
  • The authors demonstrated that this approach can help tame complexity in environments with rich dynamics at scale, yielding increased data efficiency and improving the performance of state-of-the-art model-free architectures
Summary
  • Introduction:

    Consider a baseball player trying to perfect their pitch. The player performs an arm motion and releases the ball towards the batter, but suppose that instead of observing where the ball lands and the reaction of the batter, the player only gets told the result of the play in terms of points or, worse, only gets told the final result of the game.
  • Model-free RL methods directly consider the relation X to Z, and focus solely upon predicting and optimising this goal, rather than attempting to learn the full dynamics
  • These methods have recently dominated the literature, and have attained the best performance in a wide array of complex problems with high-dimensional observations (Mnih et al, 2015; Schulman et al, 2017; Haarnoja et al, 2018; Guez et al, 2019).
  • This usually entails learning an estimate of vπ for the current policy π, this is the problem the authors focus on in this paper
  • Methods:

    The illustrative example in Section 3.4 demonstrated the positive effect of hindsight modeling in a simple policy evaluation setting.
  • The authors explore these benefits in the context of policy optimization in challenging domains, a custom navigation task called Portal Choice, and Atari 2600.
  • If the context matches the goal room color, a reward of 2 is given, otherwise the reward is 0 when terminating the episode
  • Results:

    The authors observed an increase of 132.5% in the median human normalized score compared to the R2D2 baseline with the same network capacity, aggregate results are reported in Table 1.
  • Conclusion:

    High-dimensional observations in the intermediate future often contain task-relevant features that can facilitate the prediction of an RL agent’s final return.
  • A forward model is learned to predict these features, which in turn is used as input to an improved value function, yielding better policy evaluation and training at test time.
  • The authors demonstrated that this approach can help tame complexity in environments with rich dynamics at scale, yielding increased data efficiency and improving the performance of state-of-the-art model-free architectures
Tables
  • Table1: Table 1
  • Table2: Hyper-parameter values used for our R2D2 implementation
  • Table3: Hindsight modeling parameters for Atari α 0.01 β 1.0 k5 d3
Download tables as Excel
Related work
  • Recent work have used auxiliary predictions successfully in RL as a mean to obtain a richer signal for representation learning (Jaderberg et al, 2016; Sutton et al, 2011). However these additional prediction tasks are hard-coded and so they cannot adapt to the task demand when needed. We see them as a complementary approach to more efficient learning in RL.

    Buesing et al (2018) have considered using observations in an episode trajectory in hindsight to infer variables in a structural causal model of the dynamics, allowing to reason more efficiently in a model-based way about counterfactual actions. However this approach requires learning an accurate generative model of the environment.

    In supervised learning, the learning using privileged information (LUPI) framework introduced by (Vapnik & Izmailov, 2015) considers ways of leveraging privileged information at train time. Although the techniques developed in that work do not apply directly in the RL setting, some of our approach can be understood in that setting as considering the future trajectory as the privileged information for a value prediction problem.
Funding
  • We observed an increase of 132.5% in the median human normalized score compared to the R2D2 baseline with the same network capacity, aggregate results are reported in Table 1
Reference
  • Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47: 253–279, 2013.
    Google ScholarLocate open access versionFindings
  • Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racaniere, Arthur Guez, Jean-Baptiste Lespiau, and Nicolas Heess. Woulda, coulda, shoulda: Counterfactually-guided policy search. arXiv preprint arXiv:1811.06272, 2018.
    Findings
  • Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
    Findings
  • Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G Bellemare. Deepmdp: Learning continuous latent space models for representation learning. arXiv preprint arXiv:1906.02736, 2019.
    Findings
  • Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sebastien Racaniere, Theophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, et al. An investigation of model-free planning. In International Conference on Machine Learning, pp. 2464–2473, 2019.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
    Findings
  • Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
    Findings
  • Steven Kapturowski, Georg Ostrovski, Will Dabney, John Quan, and Remi Munos. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
    Google ScholarLocate open access versionFindings
  • Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for image-based robot learning. arXiv preprint arXiv:1710.06542, 2017.
    Findings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 2011.
    Google ScholarFindings
  • Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 761–768, 2011.
    Google ScholarLocate open access versionFindings
  • Erik Talvitie. Model regularization for stable sample rollouts. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp. 780–789. AUAI Press, 2014.
    Google ScholarLocate open access versionFindings
  • Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In Thirtieth AAAI conference on artificial intelligence, 2016.
    Google ScholarLocate open access versionFindings
  • Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control and knowledge transfer. Journal of machine learning research, 16(2023-2049):2, 2015.
    Google ScholarLocate open access versionFindings
  • Yuke Zhu, Ziyu Wang, Josh Merel, Andrei A. Rusu, Tom Erez, Serkan Cabi, Saran Tunyasuvunakool, Janos Kramar, Raia Hadsell, Nando de Freitas, and Nicolas Heess. Reinforcement and imitation learning for diverse visuomotor skills. CoRR, abs/1802.09564, 2018.
    Findings
  • Hyper-parameters and infrastructure are the same as reported in Kapturowski et al. (2019), with deviations as listed in table 2. For our value target, we also average different n-step returns with exponential averaging as in Q(λ) (with the return being truncated at the end of unrolls). The Q network is composed of a convolution network (cf. Vision ConvNet in table) which is followed by an LSTM with 512 hidden units. What we refer to in the main text as the internal state h is the output of the LSTM. The φ and φnetworks are MLPs with a single hidden layer of 256 units and ReLu activation function, followed by a linear which outputs a vector of dimension d. The ψθ1 function concatenates h and φ as inputs to an MLP with 256 hidden units with ReLu activation function, followed by a linear which outputs q+ (a vector of dimension 18, the size of the Atari action set). qm is obtained by passing h and φto a dueling network as described by Kapturowski et al. (2019).
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments