Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

    CoRR, 2019.

    Cited by: 1|Bibtex|Views117|Links
    EI
    Keywords:
    Reinforcement Learning Variational Information Bottleneck Learning primitives
    Wei bo:
    On Minigrid, we show how primitives trained with our method can transfer much more successfully to new tasks

    Abstract:

    Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that t...More

    Code:

    Data:

    Highlights
    • Learning policies that generalize to new environments or tasks is a fundamental challenge in reinforcement learning
    • Empirically show, that in order to achieve better generalization, the interaction between the low-level primitives and the selection thereof should itself be performed without requiring a single centralized network that understands the entire state space
    • We compare our proposed method to the following baselines: a) Option Critic (Bacon et al, 2017) – We extended the author’s implementation of the Option Critic architecture and experimented with multiple variations in terms of hyperparameters and state/goal encoding
    • We present a framework for learning an ensemble of primitive policies that can collectively solve tasks without learning an explicit master policy
    • On Minigrid, we show how primitives trained with our method can transfer much more successfully to new tasks
    • On the Ant Maze, we show that primitives initialized from a pretrained walking control can learn to walk to different goals in a stochastic, multi-modal environment with nearly twice the success rate of a more conventional hierarchical RL approach, which uses the same pretraining but a centralized high-level policy
    Summary
    • Learning policies that generalize to new environments or tasks is a fundamental challenge in reinforcement learning.
    • We frame the problem as one of information transfer between the current state and a dynamically selected primitive policy.
    • Constraining the amount of accessible information in this way naturally leads to a decentralized competition and decision mechanism where individual primitives specialize in smaller regions of the state space.
    • Contributions: In summary, the contributions of our work are as follows: (1) We propose a method for learning and operating a set of functional primitives in a decentralized way, without requiring an explicit high-level meta-controller to select the active primitives.
    • (2) We introduce an information-theoretic objective, the effects of which are twofold: a) it leads to the specialization of individual primitives to distinct regions of the state space, and b) it enables a competition mechanism, which is used to select active primitives in a decentralized manner.
    • The information bottleneck and the competition mechanism, when combined with the overall reward maximization objective, will lead to specialization of individual primitives to distinct regions in the state space.
    • Where Eπθ denotes an expectation over the state trajectories generated by the agent’s policy, rk = αkr is the reward given to the kth primitive, and βind, βreg are the parameters controlling the impact of the respective terms.
    • The key difference between our approach and all the works mentioned above is that we learn functional primitives without requiring any explicit high-level meta-controller or master policy.
    • We evaluate our approach on a number of RL environments to demonstrate that we can learn sets of primitive policies focusing on different aspects of a task and collectively solving it.
    • To evaluate the proposed method in terms of scalability, we present a series of tasks from the motion imitation domain, showing that we can use a set of distinct primitives for imitation learning.
    • Rather than relying on a centralized, learned meta-controller, the selection of active primitives is implemented through an information-theoretic mechanism.
    • On a partially observed “Minigrid” task and a continuous control “Ant Maze” walking task, our method can enable better transfer than flat policies and hierarchical RL baselines, including the Meta-learning Shared Hierarchies model and the Option-Critic framework.
    • On the Ant Maze, we show that primitives initialized from a pretrained walking control can learn to walk to different goals in a stochastic, multi-modal environment with nearly twice the success rate of a more conventional hierarchical RL approach, which uses the same pretraining but a centralized high-level policy.
    • Thereby, the already learned primitives would keep their focus on particular aspects of the task, and newly added ones could specialize on novel aspects
    Tables
    • Table1: Hyperparameters
    Related work
    • There are a wide variety of hierarchical reinforcement learning approaches(Sutton et al, 1998; Dayan & Hinton, 1993; Dietterich, 2000). One of the most widely applied HRL framework is the Options framework ((Sutton et al, 1999b)). An option can be thought of as an action that extends over multiple timesteps, thus providing the notion of temporal abstraction or subroutines in an MDP. Each option has its own policy (which is followed if the option is selected) and the termination condition (to stop the execution of that option). Many strategies are proposed for discovering options using task-specific hierarchies, such as pre-defined sub-goals (Heess et al, 2017), hand-designed features (Florensa et al, 2017), or diversity-promoting priors (Daniel et al, 2012; Eysenbach et al, 2018). These approaches do not generalize well to new tasks. Bacon et al (2017) proposed an approach to learn options in an end-to-end manner by parameterizing the intra-option policy as well as the policy and termination condition for all the options. Eigen-options (Machado et al, 2017) use the eigenvalues of the Laplacian (for the transition graph induced by the MDP) to derive an intrinsic reward for discovering options as well as learning an intra-option policy.
    Funding
    • The authors are grateful to NSERC, CIFAR, Google, Samsung, Nuance, IBM, Canada Research Chairs, Canada Graduate Scholarship Program, Nvidia for funding, and Compute Canada for computing resources
    Your rating :
    0

     

    Tags
    Comments