Provable Representation Learning for Imitation Learning via Bi-level Optimization

ICML, pp. 367-376, 2020.

Cited by: 0|Views159
EI
Weibo:
The current paper proposes a bi-level optimization framework to formulate and analyze representation learning for imitation learning using multiple demonstrators

Abstract:

A common strategy in modern learning systems is to learn a representation that is useful for many tasks, a.k.a. representation learning. We study this strategy in the imitation learning setting for Markov decision processes (MDPs) where multiple experts' trajectories are available. We formulate representation learning as a bi-level opti...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Humans can often learn from experts quickly and with a few demonstrations and the authors would like the artificial agents to do the same.
  • While several methods have been proposed [Duan et al, 2017, Finn et al, 2017b, James et al, 2018] to build agents that can adapt quickly to new tasks, none of them, to the knowledge, give provable guarantees showing the benefit of using past experience
  • They do not focus on learning a representation.
Highlights
  • Humans can often learn from experts quickly and with a few demonstrations and we would like our artificial agents to do the same
  • The current paper studies how to apply representation learning to imitation learning
  • We propose a framework to formulate this problem and analyze the statistical gains of representation learning for imitation learning
  • We first instantiate our framework in the setting where the agent can observe experts’ actions and tries to find a policy that matches the expert’s policy, a.k.a, behavior cloning. This setting can be viewed as a straightforward extension of multi-task representation learning for supervised learning [Maurer et al, 2016]. We show in this setting that with sufficient number of experts, the agent can learn a representation that provably reduces the sample complexity for a new target imitation learning task
  • As in the previous section, the number of samples required for a new task after learning a representation is independent of the class Φ but depends only on the value function class
  • The current paper proposes a bi-level optimization framework to formulate and analyze representation learning for imitation learning using multiple demonstrators
Methods
  • The authors present the experimental results.
  • These experiments have two aims: 1.
  • Verify the theory that representation learning can reduce the sample complexity in the new imitation learning task.
  • Since the goal of the experiment is to demonstrate the advantage of representation learning, the authors only consider the standard baseline where for a task the authors learn a policy π from the class Π from scratch
Conclusion
  • The above bound says that as long as the authors have enough tasks to learn a representation from Φ and sufficient samples per task to learn a linear policy, the learned policy will have small average cost on a new task from η.
  • If the complexity of the representation function class Φ is much more than number of actions (log(|Φ|) K in this case), multi-task representation learning might be much more sample efficient4.
  • 10# trajec2to0ries for t3a0rget task40 1 expert(s) 2 expert(s)The current paper proposes a bi-level optimization framework to formulate and analyze representation learning for imitation learning using multiple demonstrators.
  • The authors believe it is an interesting theoretical question to explain this phenomenon
  • Extending this bi-level optimization framework to incorporate methods beyond imitation learning is an interesting future direction.
  • While the authors fix the learned representation for a new task, once could instead fine-tune the representation given samples for a new task and a theoretical analysis of this would be of interest
Summary
  • Introduction:

    Humans can often learn from experts quickly and with a few demonstrations and the authors would like the artificial agents to do the same.
  • While several methods have been proposed [Duan et al, 2017, Finn et al, 2017b, James et al, 2018] to build agents that can adapt quickly to new tasks, none of them, to the knowledge, give provable guarantees showing the benefit of using past experience
  • They do not focus on learning a representation.
  • Methods:

    The authors present the experimental results.
  • These experiments have two aims: 1.
  • Verify the theory that representation learning can reduce the sample complexity in the new imitation learning task.
  • Since the goal of the experiment is to demonstrate the advantage of representation learning, the authors only consider the standard baseline where for a task the authors learn a policy π from the class Π from scratch
  • Conclusion:

    The above bound says that as long as the authors have enough tasks to learn a representation from Φ and sufficient samples per task to learn a linear policy, the learned policy will have small average cost on a new task from η.
  • If the complexity of the representation function class Φ is much more than number of actions (log(|Φ|) K in this case), multi-task representation learning might be much more sample efficient4.
  • 10# trajec2to0ries for t3a0rget task40 1 expert(s) 2 expert(s)The current paper proposes a bi-level optimization framework to formulate and analyze representation learning for imitation learning using multiple demonstrators.
  • The authors believe it is an interesting theoretical question to explain this phenomenon
  • Extending this bi-level optimization framework to incorporate methods beyond imitation learning is an interesting future direction.
  • While the authors fix the learned representation for a new task, once could instead fine-tune the representation given samples for a new task and a theoretical analysis of this would be of interest
Tables
  • Table1: Number of hidden units for different experiments
Download tables as Excel
Related work
  • Representation learning has shown its great power in various domains; see Bengio et al [2013] for a survey. Theoretically, Maurer et al [2016] studied the benefit representation learning for sample complexity reduction in the multi-task supervised learning setting. Recently, Arora et al [2019] analyzed the benefit of representation learning via contrastive learning. While these papers all build representations for the agent / learner, researchers also try to build representations about the environment / physical world [Wu et al, 2017].

    Imitation learning can help with sample efficiency of many problems [Ross and Bagnell, 2010, Sun et al, 2017, Daume et al, 2009, Chang et al, 2015, Pan et al, 2018]. Most existing work consider the setting where the learner can observe expert’s action. A general strategy is use supervised learning to learn a policy that maps the state to action that matches expert’s behaviors. The most straightforward one is behavior cloning [Pomerleau, 1991], which we also study in our paper. More advanced approaches have also been proposed [Ross et al, 2011, Ross and Bagnell, 2014, Sun et al, 2018]. These approaches, including behavior cloning, often enjoy sound theoretical guarantees in the single task case. Our work extends the theoretical guarantees of behavior cloning to the multi-task representation learning setting.
Reference
  • Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 2003.
    Google ScholarLocate open access versionFindings
  • Jonathan Baxter. A model of inductive bias learning. J. Artif. Int. Res., 2000.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 08 2013.
    Google ScholarLocate open access versionFindings
  • Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
    Google ScholarLocate open access versionFindings
  • Brian Bullins, Elad Hazan, Adam Kalai, and Roi Livni. Generalize across tasks: Efficient algorithms for linear representation learning. In Proceedings of the 30th International Conference on Algorithmic Learning Theory, 2019.
    Google ScholarLocate open access versionFindings
  • Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume, III, and John Langford. Learning to search better than your teacher. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15. JMLR.org, 2015.
    Google ScholarLocate open access versionFindings
  • Hal Daume, Iii, John Langford, and Daniel Marcu. Search-based structured prediction. Mach. Learn., 2009.
    Google ScholarFindings
  • Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Incremental learning-tolearn with statistical guarantees. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In Proceedings of the 36th International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
    Findings
  • Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in Neural Information Processing Systems 30. 2017.
    Google ScholarLocate open access versionFindings
  • Ashley D. Edwards, Himanshu Sahni, Yannick Schroecker, and Charles Lee Isbell. Imitating latent policies from observation. arXiv preprint arXiv:1805.07914, 2018.
    Findings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, 2017a.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. 09 2017b.
    Google ScholarFindings
  • Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Stephen James, Michael Bloesch, and Andrew Davison. Task-embedded control networks for few-shot imitation learning. 10 2018.
    Google ScholarLocate open access versionFindings
  • Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
    Google ScholarFindings
  • Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Adaptive gradient-based metalearning methods. arXiv preprint arXiv:1906.02717, 2019.
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In Proceedings of the 36th International Conference on Machine Learning, pages 3703–3712, 2019.
    Google ScholarLocate open access versionFindings
  • Andreas Maurer. Transfer bounds for linear feature learning. Machine Learning, 2009.
    Google ScholarLocate open access versionFindings
  • Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853–2884, 2016.
    Google ScholarLocate open access versionFindings
  • Remi Munos. Error bounds for approximate value iteration. In Proceedings of the 20th National Conference on Artificial Intelligence - Volume 2, AAAI’05. AAAI Press, 2005.
    Google ScholarLocate open access versionFindings
  • Remi Munos and Csaba Szepesvari. Finite-time bounds for fitted value iteration. J. Mach. Learn. Res., 2008.
    Google ScholarLocate open access versionFindings
  • Yunpeng Pan, Ching-An Cheng, Kamil Saigol, Keuntaek Lee, Xinyan Yan, Evangelos Theodorou, and Byron Boots. Agile autonomous driving using end-to-end deep imitation learning. In Proceedings of Robotics: Science and Systems, 2018.
    Google ScholarLocate open access versionFindings
  • D. A. Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3, 1991.
    Google ScholarLocate open access versionFindings
  • Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv preprint arXiv:1909.09157, 2019.
    Findings
  • Aravind Rajeswaran, Chelsea Finn, Sham Kakade, and Sergey Levine. Meta-learning with implicit gradients. arXiv preprint arXiv:1906.02717, 2019.
    Findings
  • Stephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668, 2010.
    Google ScholarLocate open access versionFindings
  • Stephane Ross and J. Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014.
    Findings
  • Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011.
    Google ScholarLocate open access versionFindings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2):853–884, 2017.
    Google ScholarLocate open access versionFindings
  • Wen Sun, J. Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining reinforcement learning and imitation learning. arXiv preprint arXiv:1805.11240, 2018.
    Findings
  • Wen Sun, Anirudh Vemula, Byron Boots, and J Andrew Bagnell. Provably efficient imitation learning from observation alone. arXiv preprint arXiv:1905.10948, 2019.
    Findings
  • Umar Syed and Robert E Schapire. A reduction from apprenticeship learning to classification. In Advances in Neural Information Processing Systems 23, pages 2253–2261. 2010.
    Google ScholarLocate open access versionFindings
  • Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. pages 5026–5033. IEEE, 2012. URL http://dblp.uni-trier.de/db/conf/iros/iros2012.html#TodorovET12.
    Locate open access versionFindings
  • Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. In IJCAI, 2018.
    Google ScholarLocate open access versionFindings
  • Jiajun Wu, Erika Lu, Pushmeet Kohli, Bill Freeman, and Joshua B. Tenenbaum. Learning to see physics via visual de-animation. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • 2. Plugging in Lemma A.1 into Equation 11 completes the proof.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments