Towards Effective Context for Meta-Reinforcement Learning: an Approach based on Contrastive Learning

Haotian Fu
Haotian Fu
Chen Chen
Chen Chen
Xidong Feng
Xidong Feng
Dong Li
Dong Li
Wulong Liu
Wulong Liu
Cited by: 0|Bibtex|Views8|Links
Keywords:
informative trajectoryeffective contextreinforcement learningcontrastive learningneural networkMore(11+)
Weibo:
We argue that improving the quality of context involves answering two questions: 1

Abstract:

Context, the embedding of previous collected trajectories, is a powerful construct for Meta-Reinforcement Learning (Meta-RL) algorithms. By conditioning on an effective context, Meta-RL policies can easily generalize to new tasks within a few adaptation steps. We argue that improving the quality of context involves answering two questio...More

Code:

Data:

0
Introduction
  • Reinforcement Learning (RL) combined with deep neural networks has achieved impressive results on various complex tasks (Mnih et al 2015; Lillicrap et al 2016; Schulman et al 2015).
  • Given a number of tasks with similar structures, Meta-RL methods aim to capture such common knowledge from previous experience on training tasks and adapt to a new task with only a small amount of interactions
  • Based on this idea, many Meta-RL methods try to learn a general model initialization and update the parameters during adaptation (Finn, Abbeel, and Levine 2017; Rothfuss et al 2019).
  • Context-based Meta-RL methods train a policy conditioned on the latent context to improve generalization
Highlights
  • Reinforcement Learning (RL) combined with deep neural networks has achieved impressive results on various complex tasks (Mnih et al 2015; Lillicrap et al 2016; Schulman et al 2015)
  • Context-based MetaReinforcement Learning (Meta-RL) methods train a policy conditioned on the latent context to improve generalization
  • We first evaluate the performance of context-based MetaRL methods after combining with contrastive context encoder on several continuous control tasks simulated via MuJoCo physics simulator (Todorov, Erez, and Tassa 2012), which are standard Meta-RL benchmarks used in prior work (Fakoor et al 2020; Rakelly et al 2019)
  • We propose that constructing a powerful context for Meta-RL involves two problems: 1) How to collect informative trajectories of which the corresponding context reflects the specification of tasks? 2) How to train a compact and sufficient encoder that can embed the task-specific information contained in prior trajectories? We propose our method CCM which tackles the above two problems respectively
  • CCM further learns a separate exploration agent with an information-theoretical objective that aims to maximize the improvement of inference after collecting new transitions
  • The empirical results on several complex simulated control tasks show that CCM outperforms state-of-the-art Meta-RL methods by addressing the aforementioned problems
Methods
Conclusion
  • The authors propose that constructing a powerful context for Meta-RL involves two problems: 1) How to collect informative trajectories of which the corresponding context reflects the specification of tasks? 2) How to train a compact and sufficient encoder that can embed the task-specific information contained in prior trajectories? The authors propose the method CCM which tackles the above two problems respectively.
  • The authors propose that constructing a powerful context for Meta-RL involves two problems: 1) How to collect informative trajectories of which the corresponding context reflects the specification of tasks?
  • The authors propose the method CCM which tackles the above two problems respectively.
  • CCM focuses on the underlying structure behind different tasks’ transitions and trains the encoder by leveraging contrastive learning.
  • CCM further learns a separate exploration agent with an information-theoretical objective that aims to maximize the improvement of inference after collecting new transitions.
  • The empirical results on several complex simulated control tasks show that CCM outperforms state-of-the-art Meta-RL methods by addressing the aforementioned problems
Summary
  • Introduction:

    Reinforcement Learning (RL) combined with deep neural networks has achieved impressive results on various complex tasks (Mnih et al 2015; Lillicrap et al 2016; Schulman et al 2015).
  • Given a number of tasks with similar structures, Meta-RL methods aim to capture such common knowledge from previous experience on training tasks and adapt to a new task with only a small amount of interactions
  • Based on this idea, many Meta-RL methods try to learn a general model initialization and update the parameters during adaptation (Finn, Abbeel, and Levine 2017; Rothfuss et al 2019).
  • Context-based Meta-RL methods train a policy conditioned on the latent context to improve generalization
  • Objectives:

    The authors aim to train a compact and sufficient encoder through extracting mid-level taskspecific features.
  • Methods:

    The authors first evaluate the performance of context-based MetaRL methods after combining with contrastive context encoder on several continuous control tasks simulated via MuJoCo physics simulator (Todorov, Erez, and Tassa 2012), which are standard Meta-RL benchmarks used in prior work (Fakoor et al 2020; Rakelly et al 2019).
  • 2) DP (Dynamics Prediction) (Lee et al 2020; Zhou, Pinto, and Gupta 2019), in which the encoder is trained by performing forward or backward prediction.
  • Conclusion:

    The authors propose that constructing a powerful context for Meta-RL involves two problems: 1) How to collect informative trajectories of which the corresponding context reflects the specification of tasks? 2) How to train a compact and sufficient encoder that can embed the task-specific information contained in prior trajectories? The authors propose the method CCM which tackles the above two problems respectively.
  • The authors propose that constructing a powerful context for Meta-RL involves two problems: 1) How to collect informative trajectories of which the corresponding context reflects the specification of tasks?
  • The authors propose the method CCM which tackles the above two problems respectively.
  • CCM focuses on the underlying structure behind different tasks’ transitions and trains the encoder by leveraging contrastive learning.
  • CCM further learns a separate exploration agent with an information-theoretical objective that aims to maximize the improvement of inference after collecting new transitions.
  • The empirical results on several complex simulated control tasks show that CCM outperforms state-of-the-art Meta-RL methods by addressing the aforementioned problems
Tables
  • Table1: CCM’s hyperparameters for sparse-reward environments
Download tables as Excel
Related work
  • Contrastive Learning Contrastive learning has recently achieved great success in learning representations for image or sequential data (van den Oord, Li, and Vinyals 2018; Henaff et al 2019; He et al 2020; Chen et al 2020). In RL, it has been used to extract reward signals in latent space (Sermanet et al 2018; Dwibedi et al 2018), or used as an auxiliary task to study representations for high-dimensional data (Srinivas, Laskin, and Abbeel 2020; Anand et al 2019). Contrastive learning helps learn representations that obey similarity constraints by dividing the data set into similar (positive) and dissimilar (negative) pairs and minimizes contrastive loss. Prior work (Srinivas, Laskin, and Abbeel 2020; Henaff et al 2019) has shown various methods of generating positive and negative pairs for image-based input data. The standard approach is to create multiple views of each datapoint like random crops and data augmentations (Wu et al 2018; Chen et al 2020; He et al 2020). However, in this work we focus on low dimensional input data and leverage natural discrimination inside the trajectories of different tasks to generate positive and negative data. The selection of contrastive loss function is also various and the most competitive one is InfoNCE (van den Oord, Li, and Vinyals 2018). The motivation behind contrastive loss is the InfoMax principle (Linsker 1988), which can be interpreted as maximizing the mutual information between two views of data. The relationships between InfoNCE loss and mutual information is conprehensively explained in (Poole et al 2019).
Reference
  • Anand, A.; Racah, E.; Ozair, S.; Bengio, Y.; Cote, M.; and Hjelm, R. D. 2019. Unsupervised State Representation Learning in Atari. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8766–8779.
    Google ScholarLocate open access versionFindings
  • Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAI Gym. CoRR abs/1606.01540.
    Findings
  • Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. E. 2020. A Simple Framework for Contrastive Learning of Visual Representations. CoRR abs/2002.05709.
    Findings
  • Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P. L.; Sutskever, I.; and Abbeel, P. 2016. RL$ˆ2$: Fast Reinforcement Learning via Slow Reinforcement Learning. CoRR abs/1611.02779.
    Findings
  • Dwibedi, D.; Tompson, J.; Lynch, C.; and Sermanet, P. 2018. Learning Actionable Representations from Visual Observations. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2018, 1577–1584. IEEE.
    Google ScholarLocate open access versionFindings
  • Fakoor, R.; Chaudhari, P.; Soatto, S.; and Smola, A. J. 2020. Meta-Q-Learning. In 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net.
    Google ScholarFindings
  • Finn, C.; Abbeel, P.; and Levine, S. 201Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, 1126–1135.
    Google ScholarLocate open access versionFindings
  • Fu, H.; Tang, H.; Hao, J.; Liu, W.; and Chen, C. 2019. MGHRL: Meta Goal-generation for Hierarchical Reinforcement Learning. ArXiv abs/1909.13607.
    Findings
  • Gupta, A.; Mendonca, R.; Liu, Y.; Abbeel, P.; and Levine, S. 2018. Meta-Reinforcement Learning of Structured Exploration Strategies. In NeurIPS.
    Google ScholarFindings
  • Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, 1856–1865.
    Google ScholarLocate open access versionFindings
  • He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. B. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, 9726–9735. IEEE.
    Google ScholarLocate open access versionFindings
  • Henaff, O. J.; Srinivas, A.; Fauw, J. D.; Razavi, A.; Doersch, C.; Eslami, S. M. A.; and van den Oord, A. 2019. Data-Efficient Image Recognition with Contrastive Predictive Coding. CoRR abs/1905.09272.
    Findings
  • Lee, K.; Seo, Y.; Lee, S.; Lee, H.; and Shin, J. 2020. Contextaware Dynamics Model for Generalization in Model-Based Reinforcement Learning. CoRR abs/2005.06800.
    Findings
  • Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016.
    Google ScholarLocate open access versionFindings
  • Linsker, R. 1988. Self-Organization in a Perceptual Network. Computer 21(3): 105–117.
    Google ScholarLocate open access versionFindings
  • Liu, E. Z.; Raghunathan, A.; Liang, P.; and Finn, C. 2020. Explore then Execute: Adapting without Rewards via Factorized Meta-Reinforcement Learning. CoRR abs/2008.02790.
    Findings
  • Liu, H.; Socher, R.; and Xiong, C. 2019. Taming MAML: Efficient unbiased meta-reinforcement learning. In ICML.
    Google ScholarFindings
  • Maaten, L. V. D.; and Hinton, G. E. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9: 2579–2605.
    Google ScholarLocate open access versionFindings
  • Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M. A.; Fidjeland, A.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature 518(7540): 529–533. doi:10.1038/nature14236.
    Locate open access versionFindings
  • Naik, D. K.; and Mammone, R. 1992. Meta-neural networks that learn by learning. [Proceedings 1992] IJCNN International Joint Conference on Neural Networks 1: 437–442 vol.1.
    Google ScholarLocate open access versionFindings
  • Pathak, D.; Agrawal, P.; Efros, A. A.; and Darrell, T. 2017. Curiosity-driven Exploration by Self-supervised Prediction. In Precup, D.; and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, ICML 2017, volume 70 of Proceedings of Machine Learning Research, 2778–2787. PMLR.
    Google ScholarLocate open access versionFindings
  • Poole, B.; Ozair, S.; van den Oord, A.; Alemi, A.; and Tucker, G. 2019. On Variational Bounds of Mutual Information. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, ICML 2019, volume 97 of Proceedings of Machine Learning Research, 5171–5180. PMLR.
    Google ScholarLocate open access versionFindings
  • Rakelly, K.; Zhou, A.; Finn, C.; Levine, S.; and Quillen, D. 2019. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 5331–5340.
    Google ScholarLocate open access versionFindings
  • Rothfuss, J.; Lee, D.; Clavera, I.; Asfour, T.; and Abbeel, P. 2019. ProMP: Proximal Meta-Policy Search. In 7th International Conference on Learning Representations, ICLR 2019.
    Google ScholarLocate open access versionFindings
  • Schmidhuber, J. 1987. Evolutionary principles in selfreferential learning.
    Google ScholarFindings
  • Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M. I.; and Moritz, P. 2015. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, 1889–1897.
    Google ScholarLocate open access versionFindings
  • Sermanet, P.; Lynch, C.; Chebotar, Y.; Hsu, J.; Jang, E.; Schaal, S.; and Levine, S. 2018. Time-Contrastive Networks: Self-Supervised Learning from Video. In 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, 1134–1141. IEEE.
    Google ScholarLocate open access versionFindings
  • Srinivas, A.; Laskin, M.; and Abbeel, P. 2020. CURL: Contrastive Unsupervised Representations for Reinforcement Learning. CoRR abs/2004.04136.
    Findings
  • Thrun, S.; and Pratt, L. Y. 1998. Learning to Learn. In Springer US.
    Google ScholarFindings
  • Todorov, E.; Erez, T.; and Tassa, Y. 2012. MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2012, 5026–5033.
    Google ScholarLocate open access versionFindings
  • van den Oord, A.; Li, Y.; and Vinyals, O. 2018. Representation Learning with Contrastive Predictive Coding. CoRR abs/1807.03748.
    Findings
  • Wang, J. X.; Kurth-Nelson, Z.; Tirumala, D.; Soyer, H.; Leibo, J. Z.; Munos, R.; Blundell, C.; Kumaran, D.; and Botvinick, M. 2016. Learning to reinforcement learn. CoRR abs/1611.05763.
    Findings
  • Wu, Z.; Xiong, Y.; Yu, S.; and Lin, D. 2018. Unsupervised Feature Learning via Non-parametric Instance Discrimination. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 3733–3742.
    Google ScholarFindings
  • Zhang, J.; Wang, J.; Hu, H.; Chen, Y.; Fan, C.; and Zhang, C. 2020. Learn to Effectively Explore in Context-Based MetaRL. CoRR abs/2006.08170.
    Findings
  • Zhou, W.; Pinto, L.; and Gupta, A. 2019. Environment Probing Interaction Policies. In 7th International Conference on Learning Representations, ICLR 2019. OpenReview.net.
    Google ScholarFindings
  • Zintgraf, L. M.; Shiarlis, K.; Igl, M.; Schulze, S.; Gal, Y.; Hofmann, K.; and Whiteson, S. 2020. VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via MetaLearning. In 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net.
    Google ScholarFindings
  • We run all experiments with OpenAI gym (Brockman et al. 2016), with the mujoco simulator (Todorov, Erez, and Tassa 2012). The benchmarks used in our experiments are visualized in Figure 7. We further modify the original tasks to be Meta-RL tasks similar to (Rakelly et al. 2019; Lee et al. 2020; Fakoor et al. 2020):
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments