On Reward-Free Reinforcement Learning with Linear Function Approximation

NIPS 2020, 2020.

Cited by: 10|Views364
EI
Weibo:
During the planning phase of the algorithm, a 0.1-optimal policy is found with probability at most 0.6 < 0.9. This paper provides both positive and negative results for reward-free reinforcement learning with linear function approximation

Abstract:

Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest. During the exploration phase, an agent collects samples without using a pre-specified reward function. After the exploration phase, a reward function is given, and the ...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • In reinforcement learning (RL), an agent repeatedly interacts with an unknown environment to maximize the cumulative reward.
  • Formally presented in Section 3, samples O d3H6/ε2 trajectories during the exploration phase, and outputs ε-optimal policies for an arbitrary number of reward functions satisfying Assumption 2.1 during the planning phase with high probability.
  • The authors' second contribution is a hardness result for reward-free RL under the linear Q∗ assumption, which only requires the optimal value function to be a linear function of the given feature extractor and weaker than the linear MDP assumption.
Highlights
  • In reinforcement learning (RL), an agent repeatedly interacts with an unknown environment to maximize the cumulative reward
  • In this work we study the reward-free RL setting which was formalized in the recent work by Jin et al [2020]
  • In the planning phase, a specific reward function is given to the agent, and the goal is to use samples collected during the exploration phase to output a near-optimal policy for the given reward function
  • Our second contribution is a hardness result for reward-free RL under the linear Q∗ assumption, which only requires the optimal value function to be a linear function of the given feature extractor and weaker than the linear Markov decision process (MDP) assumption
  • We show that there exists a class of MDPs which satisfies Assumption 2.2, such that any reward-free RL algorithm requires exponential number of samples during the exploration phase in order to find a near-optimal policy during the planning phase
  • During the planning phase of the algorithm, a 0.1-optimal policy is found with probability at most 0.6 < 0.9. This paper provides both positive and negative results for reward-free RL with linear function approximation
Results
  • Formally presented in Section 4, shows that under the linear Q∗ assumption, any algorithm requires exponential number of samples during the exploration phase, so that the agent could output a near-optimal policy during the planning phase with high probability.
  • The authors' hardness result demonstrates that under the same assumption, any algorithm requires exponential number of samples in the reward-free setting.
  • The authors show that for deterministic systems, under the linear Q∗ assumption, there exists a polynomial sample complexity upper bound in the reward-free setting when the agent has sampling access to a generative model.
  • For a specific set of reward functions r = {rh}Hh=1, given a policy π, a level h ∈ [H] and a state-action pair (s, a) ∈ S × A, the Q-function is defined as
  • To measure the performance of an algorithm, the authors define the sample complexity to be the number of episodes K required in the exploration phase to output an ε-optimal policy in the planning phase.
  • After collecting O d3H6 log(dHδ−1ε−1)/ε2 trajectories during the exploration phase, with probability 1 − δ, the algorithm outputs an ε-optimal policy for an arbitrary number of reward functions satisfying Assumption 2.1 during the planning phase.
  • The authors show that there exists a class of MDPs which satisfies Assumption 2.2, such that any reward-free RL algorithm requires exponential number of samples during the exploration phase in order to find a near-optimal policy during the planning phase.
Conclusion
  • There exists a class of deterministic systems that satisfy Assumption 2.2 with d = poly(H), such that any reward-free algorithm requires at least Ω(2H ) samples during the exploration phase in order to find a 0.1-optimal policy with probability at least 0.9 during the planning phase for a given set of reward functions r = {rh}Hh=1.
  • Since there are 2H−2 state-action pairs (s, a) ∈ SH−2 × A and only one of them satisfies PH−2(s, a) = s+H−1, and the algorithm samples at most 2H /100 trajectories during the exploration phase, E holds with probability at least 0.9.
  • An interesting future direction is to generalize the results to more general function classes using techniques, e.g., in [Wen and Van Roy, 2013, Ayoub et al, 2020, Wang et al, 2020]
Summary
  • In reinforcement learning (RL), an agent repeatedly interacts with an unknown environment to maximize the cumulative reward.
  • Formally presented in Section 3, samples O d3H6/ε2 trajectories during the exploration phase, and outputs ε-optimal policies for an arbitrary number of reward functions satisfying Assumption 2.1 during the planning phase with high probability.
  • The authors' second contribution is a hardness result for reward-free RL under the linear Q∗ assumption, which only requires the optimal value function to be a linear function of the given feature extractor and weaker than the linear MDP assumption.
  • Formally presented in Section 4, shows that under the linear Q∗ assumption, any algorithm requires exponential number of samples during the exploration phase, so that the agent could output a near-optimal policy during the planning phase with high probability.
  • The authors' hardness result demonstrates that under the same assumption, any algorithm requires exponential number of samples in the reward-free setting.
  • The authors show that for deterministic systems, under the linear Q∗ assumption, there exists a polynomial sample complexity upper bound in the reward-free setting when the agent has sampling access to a generative model.
  • For a specific set of reward functions r = {rh}Hh=1, given a policy π, a level h ∈ [H] and a state-action pair (s, a) ∈ S × A, the Q-function is defined as
  • To measure the performance of an algorithm, the authors define the sample complexity to be the number of episodes K required in the exploration phase to output an ε-optimal policy in the planning phase.
  • After collecting O d3H6 log(dHδ−1ε−1)/ε2 trajectories during the exploration phase, with probability 1 − δ, the algorithm outputs an ε-optimal policy for an arbitrary number of reward functions satisfying Assumption 2.1 during the planning phase.
  • The authors show that there exists a class of MDPs which satisfies Assumption 2.2, such that any reward-free RL algorithm requires exponential number of samples during the exploration phase in order to find a near-optimal policy during the planning phase.
  • There exists a class of deterministic systems that satisfy Assumption 2.2 with d = poly(H), such that any reward-free algorithm requires at least Ω(2H ) samples during the exploration phase in order to find a 0.1-optimal policy with probability at least 0.9 during the planning phase for a given set of reward functions r = {rh}Hh=1.
  • Since there are 2H−2 state-action pairs (s, a) ∈ SH−2 × A and only one of them satisfies PH−2(s, a) = s+H−1, and the algorithm samples at most 2H /100 trajectories during the exploration phase, E holds with probability at least 0.9.
  • An interesting future direction is to generalize the results to more general function classes using techniques, e.g., in [Wen and Van Roy, 2013, Ayoub et al, 2020, Wang et al, 2020]
Related work
Funding
  • RW and RS are supported in part by NSF IIS1763562, AFRL CogDeCON FA875018C0014, and DARPA SAGAMORE HR00111990016
  • SSD is supported by NSF grant DMS-1638352 and the Infosys Membership
Reference
  • Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pages 1–9, 2012.
    Google ScholarLocate open access versionFindings
  • Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–31. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261, 2019.
    Findings
  • Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
    Google ScholarFindings
  • András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
    Google ScholarLocate open access versionFindings
  • Alex Ayoub, Zeyu Jia, Szepesvari Csaba, Mengdi Wang, and Lin F. Yang. Model-based reinforcement learning with value-targeted regression. arXiv preprint arXiv:2006.01107, 2020.
    Findings
  • Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in neural information processing systems, pages 1471–1479, 2016.
    Google ScholarLocate open access versionFindings
  • Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming, volume 5. Athena Scientific Belmont, MA, 1996.
    Google ScholarLocate open access versionFindings
  • Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
    Google ScholarLocate open access versionFindings
  • Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
    Findings
  • Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, 2020.
    Google ScholarLocate open access versionFindings
  • Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. arXiv preprint arXiv:1905.00360, 2019.
    Findings
  • John D Co-Reyes, YuXuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. arXiv preprint arXiv:1806.02813, 2018.
    Findings
  • Cédric Colas, Pierre Fournier, Olivier Sigaud, and Pierre-Yves Oudeyer. Curious: Intrinsically motivated multi-task multi-goal reinforcement learning. 2018.
    Google ScholarFindings
  • Simon S Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudík, and John Langford. Provably efficient rl with rich observations via latent state decoding. arXiv preprint arXiv:1901.09018, 2019a.
    Findings
  • Simon S Du, Yuping Luo, Ruosong Wang, and Hanrui Zhang. Provably efficient q-learning with function approximation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems, pages 8058–8068, 2019b.
    Google ScholarLocate open access versionFindings
  • Simon S. Du, Sham M. Kakade, Ruosong Wang, and Lin F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020a. URL https://openreview.net/forum?id=r1genAVKPB.
    Locate open access versionFindings
  • Simon S Du, Jason D Lee, Gaurav Mahajan, and Ruosong Wang. Agnostic q-learning with function approximation in deterministic systems: Tight bounds on approximation error and sample complexity. arXiv preprint arXiv:2002.07125, 2020b.
    Findings
  • Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
    Findings
  • Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. arXiv preprint arXiv:1705.06366, 2017.
    Findings
  • Elad Hazan, Sham M Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. arXiv preprint arXiv:1812.02690, 2018.
    Findings
  • Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
    Google ScholarLocate open access versionFindings
  • Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
    Findings
  • Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, 2020.
    Google ScholarLocate open access versionFindings
  • William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
    Google ScholarLocate open access versionFindings
  • Sobhan Miryoosefi, Kianté Brantley, Hal Daume III, Miro Dudik, and Robert E Schapire. Reinforcement learning with convex constraints. In Advances in Neural Information Processing Systems, pages 14070–14079, 2019.
    Google ScholarLocate open access versionFindings
  • Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. arXiv preprint arXiv:1911.05815, 2019.
    Findings
  • Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
    Google ScholarLocate open access versionFindings
  • Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pages 9191–9200, 2018.
    Google ScholarLocate open access versionFindings
  • Pierre-Yves Oudeyer, Frdric Kaplan, and Verena V Hafner. Intrinsic motivation systems for autonomous mental development. IEEE transactions on evolutionary computation, 11(2):265–286, 2007.
    Google ScholarLocate open access versionFindings
  • Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017.
    Google ScholarLocate open access versionFindings
  • Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
    Findings
  • Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
    Google ScholarLocate open access versionFindings
  • Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
    Google ScholarLocate open access versionFindings
  • Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
    Findings
  • Ruosong Wang, Ruslan Salakhutdinov, and Lin F Yang. Provably efficient reinforcement learning with general value function approximation. arXiv preprint arXiv:2005.10804, 2020.
    Findings
  • Zheng Wen and Benjamin Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
    Google ScholarLocate open access versionFindings
  • Lin F. Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.
    Google ScholarLocate open access versionFindings
  • Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of complexity. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 222–227. IEEE, 1977.
    Google ScholarLocate open access versionFindings
  • Andrea Zanette, David Brandfonbrener, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. arXiv preprint arXiv:1911.00567, 2019.
    Findings
  • To prove Lemma 3.1, we need a concentration lemma similar to Lemma B.3 in [Jin et al., 2019].
    Google ScholarLocate open access versionFindings
  • Proof. The proof is nearly identical to that of Lemma B.3 in [Jin et al., 2019]. The only deference in our case is that we have a different reward functions at different episodes. However, note that in our case rhk(·, ·) = ukh(·, ·)/H
    Google ScholarFindings
  • a for some Λ ∈ Rd×d, and w ∈ Rd. Therefore, the value function shares exactly the same function class as that in Lemma D.6 in [Jin et al., 2019]. The rest of the proof follow similarly.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments