AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We presented Predictive Information Soft Actor-Critic, a continuous control algorithm that trains a SAC agent using an auxiliary objective that learns a compressed representation of the predictive information of the RL environment dynamics

Predictive Information Accelerates Learning in RL

NIPS 2020, (2020): 11890-11901

Cited: 25|Views152
EI
Full Text
Bibtex
Weibo

Abstract

The Predictive Information is the mutual information between the past and the future, I(X_past; X_future). We hypothesize that capturing the predictive information is useful in RL, since the ability to model what will happen next is necessary for success on many tasks. To test our hypothesis, we train Soft Actor-Critic (SAC) agents from...More
0
Introduction
  • Many Reinforcement Learning environments have specific dynamics and clear temporal structure: observations of the past allow them to predict what is likely to happen in the future.
  • The environment may be only partially observable, or the state may be represented in very high dimensions, such as an image.
  • In such environments, the task of the agent may be described as finding a representation of the past that is most useful for predicting the future, upon which an optimal policy may more be learned
Highlights
  • Many Reinforcement Learning environments have specific dynamics and clear temporal structure: observations of the past allow us to predict what is likely to happen in the future
  • Sample Efficiency: We demonstrate strong gains in sample efficiency on nine tasks from the DM Control Suite [36] of continuous control tasks, compared to state-of-the-art baselines such as Dreamer [15] and DrQ [25] (Section 4.1)
  • Ablations: Through careful ablations and analysis, we show that the benefit of PI-Soft Actor-Critic (SAC) is due substantially to the use of the Predictive Information and compression (Section 4.2)
  • We evaluate Predictive Information Soft Actor-Critic (PI-SAC) on the DeepMind control suite [36] and compare with leading model-free and model-based approaches for continuous control from pixels: SLAC [27], Dreamer [15], and DrQ [25]
  • We presented Predictive Information Soft Actor-Critic (PI-SAC), a continuous control algorithm that trains a SAC agent using an auxiliary objective that learns a compressed representation of the predictive information of the RL environment dynamics
  • We showed with extensive experiments that learning a compressed predictive information representation can substantially improve sample efficiency and training stability at no cost to final agent performance
Methods
  • The authors evaluate PI-SAC on the DeepMind control suite [36] and compare with leading model-free and model-based approaches for continuous control from pixels: SLAC [27], Dreamer [15], and DrQ [25].
  • The authors' benchmark includes the six tasks from the PlaNet benchmark [16] and three additional tasks: Cartpole Balance Sparse, Hopper Stand, and Walker Stand.
  • On each PlaNet task, the authors evaluate PI-SAC with the action repeat at which SLAC performs the best2, and compare with the best DrQ result.
  • On Walker Walk, Cartpole Balance Sparse, Hopper Stand, and Walker Stand, the authors evaluate PI-SAC with action repeat 2 and directly compare with Dreamer and DrQ results on the Dreamer benchmark.
Results
  • Evaluation Setups

    The authors evaluate the agent at every evaluation point by computing the average episode return over 10 evaluation episodes.
  • For most of the experiments, the authors evaluate every 2500 environment steps after applying action repeat for Cheetah, Walker, and Hopper tasks.
  • For Ball in Cup, Cartpole, Finger, and Reacher tasks, the authors evaluate every 1000 environment steps after applying action repeat.
  • SAC Implementation.
  • The authors' SAC implementation is based off of TF-Agents [11].
  • It follows the standard SAC implementation [14].
  • The performance and sample-efficiency match with the benchmark results reported in [14]
Conclusion
  • The authors presented Predictive Information Soft Actor-Critic (PI-SAC), a continuous control algorithm that trains a SAC agent using an auxiliary objective that learns a compressed representation of the predictive information of the RL environment dynamics.
  • The authors showed with extensive experiments that learning a compressed predictive information representation can substantially improve sample efficiency and training stability at no cost to final agent performance.
  • The authors gave preliminary indications that compressed representations can generalize better than uncompressed representations at task transfer.
  • Future work will explore variations of the PI-SAC architecture, such as using RNNs for environments that require long-term planning
Tables
  • Table1: Left: Global PI-SAC hyperparameters. Right: Per-task PI-SAC hyperparameters. PlaNet tasks are indicated with (P)
Download tables as Excel
Related work
  • Future Prediction in RL. Future prediction is commonly used in reinforcement learning in a few different ways. Model-based RL algorithms build world model(s) to predict the future conditioned on

    100S0 RC: Cartpole Balance (No Aug) beta=1.0 beta=0.1 beta=0.01 beta=0.001 0.05 beta=0.0

    100T0GT: Cartpole Swingup (No Aug) 1.0 beta=0.0

    1000 SRC: Walker Stand TGT: Walker Walk SRC: Walker Walk

    TGT: Walker Stand

    Cartpole experiment to amplify the benefit of compression. All curves show 5 runs.

    past observations and actions, and then find the policy through planning [16, 5, 15, 23, 12, 21, 35]. On the other hand, as we study in this work, it is often used as an auxiliary or representation learning method for model-free RL agents [28, 30, 12, 20, 32, 1, 8]. We hypothesize that the success of these methods comes from the predictive information they capture. In contrast to prior work, our approach directly measures and compresses the predictive information, so that the representation avoids capturing the large amount of information in the past that is irrelevant to the future. As described in Section 2, the predictive information that we consider captures environment dynamics. This is different from some other approaches [28, 1] that use a contrastive mutual information estimator (e.g. InfoNCE) to capture temporal coherence of observations instead of environment dynamics (since their predictions are not action-conditioned) and thus have their limitations in off-policy learning.
Reference
  • Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc-Alexandre Côté, and R Devon Hjelm. Unsupervised state representation learning in atari. In Advances in Neural Information Processing Systems, pages 8766–8779, 2019.
    Google ScholarLocate open access versionFindings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
    Findings
  • Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, and Amir Yehudayoff. Learners that use little information. In Algorithmic Learning Theory, pages 25–55, 2018.
    Google ScholarLocate open access versionFindings
  • William Bialek and Naftali Tishby. Predictive information. arXiv preprint cond-mat/9902341, 1999.
    Google ScholarFindings
  • Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765, 2018.
    Google ScholarLocate open access versionFindings
  • Ian Fischer. The conditional entropy bottleneck. arXiv preprint arXiv:2002.05379, 2020.
    Findings
  • Ian Fischer and Alexander A Alemi. CEB improves model robustness. arXiv preprint arXiv:2002.05380, 2020.
    Findings
  • Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G Bellemare. DeepMDP: Learning continuous latent space models for representation learning. In International Conference on Machine Learning, pages 2170–2179, 2019.
    Google ScholarLocate open access versionFindings
  • Anirudh Goyal, Yoshua Bengio, Matthew Botvinick, and Sergey Levine. The variational bandwidth bottleneck: Stochastic evaluation on an information budget. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Anirudh Goyal, Riashat Islam, Daniel Strouse, Zafarali Ahmed, Matthew Botvinick, Hugo Larochelle, Yoshua Bengio, and Sergey Levine. Infobot: Transfer and exploration via the information bottleneck. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Neal Wu, Efi Kokiopoulou, Luciano Sbaiz, Jamie Smith, Gábor Bartók, Jesse Berent, Chris Harris, Vincent Vanhoucke, and Eugene Brevdo. TF-Agents: A library for reinforcement learning in tensorflow. https://github.com/tensorflow/agents, 2018.[Online; accessed 25-June-2019].
    Findings
  • David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
    Findings
  • Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pages 1352–1361, 2017.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
    Findings
  • Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565, 2019.
    Google ScholarLocate open access versionFindings
  • R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Maximilian Igl, Kamil Ciosek, Yingzhen Li, Sebastian Tschiatschek, Cheng Zhang, Sam Devlin, and Katja Hofmann. Generalization in reinforcement learning with selective noise injection and information bottleneck. In Advances in Neural Information Processing Systems, pages 13978–13990, 2019.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
    Google ScholarLocate open access versionFindings
  • Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12498–12509, 2019.
    Google ScholarLocate open access versionFindings
  • Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.
    Google ScholarLocate open access versionFindings
  • Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Modelbased reinforcement learning for atari. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649, 2020.
    Findings
  • Michael Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data. arXiv preprint arXiv:2004.14990, 2020.
    Findings
  • Alex X Lee, Anusha Nagabandi, Pieter Abbeel, and Sergey Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953, 2019.
    Findings
  • Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
    Findings
  • Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. In International Conference on Machine Learning, pages 5171–5180, 2019.
    Google ScholarLocate open access versionFindings
  • Jürgen Schmidhuber. Making the world differentiable: On using self-supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical report, Institut für Informatik, Technische Universität München, 1990.
    Google ScholarFindings
  • Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411(29-30):2696–2711, 2010.
    Google ScholarLocate open access versionFindings
  • Evan Shelhamer, Parsa Mahmoudieh, Max Argus, and Trevor Darrell. Loss is its own reward: Self-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016.
    Findings
  • Saurabh Singh and Shankar Krishnan. Filter response normalization layer: Eliminating batch dependence in the training of deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11237–11246, 2020.
    Google ScholarLocate open access versionFindings
  • Aravind Srinivas, Michael Laskin, and Pieter Abbeel. CURL: Contrastive unsupervised representations for reinforcement learning. arXiv preprint arXiv:2004.04136, 2020.
    Findings
  • Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
    Google ScholarLocate open access versionFindings
  • Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. DeepMind control suite. arXiv preprint arXiv:1801.00690, 2018.
    Findings
  • Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and Rob Fergus. Improving sample efficiency in model-free reinforcement learning from images. arXiv preprint arXiv:1910.01741, 2019.
    Findings
  • Brian D Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University, 2010.
    Google ScholarFindings
  • Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn