Meta-Q-Learning

    International Conference on Learning Representations, 2020.

    Cited by: 0|Bibtex|54|
    Keywords:
    meta reinforcement learning propensity estimation off-policy
    Wei bo:
    In Fig. 2, is to show that vanilla off-policy learning with context, without any adaptation is competitive with state of the art Meta-Reinforcement Learning algorithms

    Abstract:

    This paper introduces Meta-Q-Learning (MQL), a new off-policy algorithm for meta-Reinforcement Learning (meta-RL). MQL builds upon three simple ideas. First, we show that Q-learning is competitive with state-of-the-art meta-RL algorithms if given access to a context variable that is a representation of the past trajectory. Second, a multi...More

    Code:

    Figures:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    Introduction
    • Reinforcement Learning (RL) algorithms have demonstrated good performance on simulated data.
    • MAML TD3-context PEARL ing this performance to real robots: (i) robots are complex and fragile which precludes extensive data collection, and a real robot may face an environment that is different than the simulated environment it was trained in.
    • This has fueled research into Meta-.
    • Given a deterministic policy uθ, the actionvalue function for γ-discounted future rewards rtk := rk(xt, uθ) over an infinite time-horizon is
    Highlights
    • Reinforcement Learning (RL) algorithms have demonstrated good performance on simulated data
    • In Fig. 2, is to show that vanilla off-policy learning with context, without any adaptation is competitive with state of the art Meta-Reinforcement Learning algorithms
    • Policies that have access to the context can generalize to the validation tasks and achieve performance that is comparable to more sophisticated Meta-Reinforcement Learning algorithms
    • Qlearning with context is sufficient to be competitive on current Meta-Reinforcement Learning benchmarks
    • The fact that even vanilla Q-learning with a context variable—without meta-training and without any adaptation— is competitive with state of the art algorithms indicates that (i) training and validation tasks in the current Meta-Reinforcement Learning benchmarks are quite similar to each other and current benchmarks may be insufficient to evaluate Meta-Reinforcement Learning algorithms
    • Both of these are a call to action and point to the need to invest resources towards creating better benchmark problems for Meta-Reinforcement Learning that drive the innovation of new algorithms
    Methods
    • The authors first discuss the setup and provide details the benchmark in Sec. 4.1
    • This is followed by empirical results and ablation experiments in Sec. 4.2.
    • Tasks and algorithms: The authors use the MuJoCo (Todorov et al, 2012) simulator with OpenAI Gym (Brockman et al, 2016) on continuous-control meta-RL benchmark tasks
    • These tasks have different rewards, randomized system parameters (Walker-2D-Params) and have been used in previous papers such as Finn et al (2017); Rothfuss et al (2018); Rakelly et al (2019).
    • The authors obtained the training curves and hyper-parameters for all the three algorithms from the published code by Rakelly et al (2019)
    Results
    • In Fig. 2, is to show that vanilla off-policy learning with context, without any adaptation is competitive with state of the art meta-RL algorithms.
    • Policies that have access to the context can generalize to the validation tasks and achieve performance that is comparable to more sophisticated meta-RL algorithms.
    • Compare the training curve for TD3-context for the Ant-Goal-2D environment in Fig. 2 with that of the same environment in Fig. 3: the former shows a prominent dip in performance as meta-training progresses; this dip is absent in Fig. 3 and can be attributed to the adaptation phase of MQL
    Conclusion
    • The algorithm proposed in this paper, namely MQL, builds upon on three simple ideas. First, Qlearning with context is sufficient to be competitive on current meta-RL benchmarks.
    • The fact that even vanilla Q-learning with a context variable—without meta-training and without any adaptation— is competitive with state of the art algorithms indicates that (i) training and validation tasks in the current meta-RL benchmarks are quite similar to each other and current benchmarks may be insufficient to evaluate meta-RL algorithms
    • Both of these are a call to action and point to the need to invest resources towards creating better benchmark problems for meta-RL that drive the innovation of new algorithms
    Tables
    • Table1: Hyper-parameters for MQL and TD3 for continuous-control meta-RL benchmark tasks. We use a network with two full-connected layers for all environments. The batch-size in Adam is fixed to 256 for all environments. The abbreviation HC stands for Half-Cheetah. These hyper-parameters were tuned by grid-search
    Download tables as Excel

    Reference
    • Deepak Agarwal, Lihong Li, and Alexander Smola. Linear-time estimators for propensity scores. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 93–100, 2011.
    • Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
    • Jonathan Baxter. Learning internal representations. Flinders University of S. Aust., 1995. Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198, 2000. Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pp. 6–8. Univ. of Texas, 1992. Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule, 1997. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech
    • Zaremba. OpenAI Gym. arXiv:1606.01540, 2016. Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. 2018. Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078, 201Guneet S Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. A baseline for few-shot image classification. arXiv:1909.02729, 2019. Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv:1611.02779, 2016. Miroslav Dudık, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. arXiv:1103.4601, 2011. Vıctor Elvira, Luca Martino, and Christian P Robert. Rethinking the effective sample size. arXiv:1809.04129, 2018.
    • Rasool Fakoor, Pratik Chaudhari, and Alexander J Smola. P3o: Policy-on policy-off policy optimization. arXiv:1905.01756, 2019.
    • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126– 1135. JMLR. org, 2017.
    • Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. arXiv:1812.02900, 2018a.
    • Scott Fujimoto, Herke van Hoof, and Dave Meger. Addressing function approximation error in actor-critic methods. arXiv:1802.09477, 2018b.
    • Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4367–4375, 2018.
    • Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. In 2015 AAAI Fall Symposium Series, 2015.
    • Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, and David Silver. Memory-based control with recurrent neural networks. arXiv:1512.04455, 2015.
    • Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    • Rein Houthooft, Yuhua Chen, Phillip Isola, Bradly Stadie, Filip Wolski, OpenAI Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. In Advances in Neural Information Processing Systems, pp. 5400–5409, 2018.
    • Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Are deep policy gradient algorithms truly policy gradient algorithms? arXiv:1811.02553, 2018.
    • Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. arXiv:1511.03722, 2015.
    • Joseph DY Kang, Joseph L Schafer, et al. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539, 2007.
    • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014. Augustine Kong. A note on importance sampling using standardized weights. University of Chicago, Dept. of
    • Jurgen Schmidhuber. Evolutionary principles in self-referential learning. On learning how to learn: The meta-meta-... hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich, 1987.
    • Jurgen Schmidhuber, Jieyu Zhao, and Marco Wiering. Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement. Machine Learning, 28(1):105–130, Jul 1997. ISSN 1573-0565.
    • John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, volume 37, pp. 1889–1897, 2015.
    • David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning, 2014.
    • Adrian Smith. Sequential Monte Carlo methods in practice. Springer Science & Business Media, 2013. Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in
    • Neural Information Processing Systems, pp. 4077–4087, 2017. Sebastian Thrun. Is learning the n-th thing any easier than learning the first? In Advances in neural information processing systems, pp. 640–646, 1996. Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012
    • IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012. Paul E Utgoff. Shift of bias for inductive concept learning. Machine learning: An artificial intelligence approach, 2:107–148, 1986. Jane X. Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matthew Botvinick. Learning to reinforcement learn. CoRR, abs/1611.05763, 2016. URL http://arxiv.org/abs/1611.05763.
    Your rating :
    0

     

    Tags
    Comments