# On Reward-Free Reinforcement Learning with Linear Function Approximation

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest. During the exploration phase, an agent collects samples without using a pre-specified reward function. After the exploration phase, a reward function is given, and the ...More

Code:

Data:

Introduction

- In reinforcement learning (RL), an agent repeatedly interacts with an unknown environment to maximize the cumulative reward.
- Formally presented in Section 3, samples O d3H6/ε2 trajectories during the exploration phase, and outputs ε-optimal policies for an arbitrary number of reward functions satisfying Assumption 2.1 during the planning phase with high probability.
- The authors' second contribution is a hardness result for reward-free RL under the linear Q∗ assumption, which only requires the optimal value function to be a linear function of the given feature extractor and weaker than the linear MDP assumption.

Highlights

- In reinforcement learning (RL), an agent repeatedly interacts with an unknown environment to maximize the cumulative reward
- In this work we study the reward-free RL setting which was formalized in the recent work by Jin et al [2020]
- In the planning phase, a specific reward function is given to the agent, and the goal is to use samples collected during the exploration phase to output a near-optimal policy for the given reward function
- Our second contribution is a hardness result for reward-free RL under the linear Q∗ assumption, which only requires the optimal value function to be a linear function of the given feature extractor and weaker than the linear Markov decision process (MDP) assumption
- We show that there exists a class of MDPs which satisfies Assumption 2.2, such that any reward-free RL algorithm requires exponential number of samples during the exploration phase in order to find a near-optimal policy during the planning phase
- During the planning phase of the algorithm, a 0.1-optimal policy is found with probability at most 0.6 < 0.9. This paper provides both positive and negative results for reward-free RL with linear function approximation

Results

- Formally presented in Section 4, shows that under the linear Q∗ assumption, any algorithm requires exponential number of samples during the exploration phase, so that the agent could output a near-optimal policy during the planning phase with high probability.
- The authors' hardness result demonstrates that under the same assumption, any algorithm requires exponential number of samples in the reward-free setting.
- The authors show that for deterministic systems, under the linear Q∗ assumption, there exists a polynomial sample complexity upper bound in the reward-free setting when the agent has sampling access to a generative model.
- For a specific set of reward functions r = {rh}Hh=1, given a policy π, a level h ∈ [H] and a state-action pair (s, a) ∈ S × A, the Q-function is defined as
- To measure the performance of an algorithm, the authors define the sample complexity to be the number of episodes K required in the exploration phase to output an ε-optimal policy in the planning phase.
- After collecting O d3H6 log(dHδ−1ε−1)/ε2 trajectories during the exploration phase, with probability 1 − δ, the algorithm outputs an ε-optimal policy for an arbitrary number of reward functions satisfying Assumption 2.1 during the planning phase.
- The authors show that there exists a class of MDPs which satisfies Assumption 2.2, such that any reward-free RL algorithm requires exponential number of samples during the exploration phase in order to find a near-optimal policy during the planning phase.

Conclusion

- There exists a class of deterministic systems that satisfy Assumption 2.2 with d = poly(H), such that any reward-free algorithm requires at least Ω(2H ) samples during the exploration phase in order to find a 0.1-optimal policy with probability at least 0.9 during the planning phase for a given set of reward functions r = {rh}Hh=1.
- Since there are 2H−2 state-action pairs (s, a) ∈ SH−2 × A and only one of them satisfies PH−2(s, a) = s+H−1, and the algorithm samples at most 2H /100 trajectories during the exploration phase, E holds with probability at least 0.9.
- An interesting future direction is to generalize the results to more general function classes using techniques, e.g., in [Wen and Van Roy, 2013, Ayoub et al, 2020, Wang et al, 2020]

Summary

- In reinforcement learning (RL), an agent repeatedly interacts with an unknown environment to maximize the cumulative reward.
- Formally presented in Section 3, samples O d3H6/ε2 trajectories during the exploration phase, and outputs ε-optimal policies for an arbitrary number of reward functions satisfying Assumption 2.1 during the planning phase with high probability.
- The authors' second contribution is a hardness result for reward-free RL under the linear Q∗ assumption, which only requires the optimal value function to be a linear function of the given feature extractor and weaker than the linear MDP assumption.
- Formally presented in Section 4, shows that under the linear Q∗ assumption, any algorithm requires exponential number of samples during the exploration phase, so that the agent could output a near-optimal policy during the planning phase with high probability.
- The authors' hardness result demonstrates that under the same assumption, any algorithm requires exponential number of samples in the reward-free setting.
- The authors show that for deterministic systems, under the linear Q∗ assumption, there exists a polynomial sample complexity upper bound in the reward-free setting when the agent has sampling access to a generative model.
- For a specific set of reward functions r = {rh}Hh=1, given a policy π, a level h ∈ [H] and a state-action pair (s, a) ∈ S × A, the Q-function is defined as
- To measure the performance of an algorithm, the authors define the sample complexity to be the number of episodes K required in the exploration phase to output an ε-optimal policy in the planning phase.
- After collecting O d3H6 log(dHδ−1ε−1)/ε2 trajectories during the exploration phase, with probability 1 − δ, the algorithm outputs an ε-optimal policy for an arbitrary number of reward functions satisfying Assumption 2.1 during the planning phase.
- The authors show that there exists a class of MDPs which satisfies Assumption 2.2, such that any reward-free RL algorithm requires exponential number of samples during the exploration phase in order to find a near-optimal policy during the planning phase.
- There exists a class of deterministic systems that satisfy Assumption 2.2 with d = poly(H), such that any reward-free algorithm requires at least Ω(2H ) samples during the exploration phase in order to find a 0.1-optimal policy with probability at least 0.9 during the planning phase for a given set of reward functions r = {rh}Hh=1.
- Since there are 2H−2 state-action pairs (s, a) ∈ SH−2 × A and only one of them satisfies PH−2(s, a) = s+H−1, and the algorithm samples at most 2H /100 trajectories during the exploration phase, E holds with probability at least 0.9.
- An interesting future direction is to generalize the results to more general function classes using techniques, e.g., in [Wen and Van Roy, 2013, Ayoub et al, 2020, Wang et al, 2020]

Related work

- Practitioners have proposed various exploration algorithms for RL without using explicit reward signals [Oudeyer et al, 2007, Schmidhuber, 2010, Bellemare et al, 2016, Houthooft et al, 2016, Tang et al, 2017, Florensa et al, 2017, Pathak et al, 2017, Tang et al, 2017, Achiam et al, 2017, Hazan et al, 2018, Burda et al, 2018, Colas et al, 2018, Co-Reyes et al, 2018, Nair et al, 2018, Eysenbach et al, 2018, Pong et al, 2019]. Theoretically, for the tabular case, while the reward-free setting is first formalized in Jin et al [2020], algorithms in earlier works also guarantee to collect a polynomial-size dataset with coverage guarantees [Brafman and Tennenholtz, 2002, Hazan et al, 2018, Du et al, 2019a, Misra et al, 2019].1 Jin et al [2020] gave a new algorithm which has O(|S|2 |A| poly(H)/ε2) sample complexity. They also provided a lower bound showing the dependency of their algorithm on |S| , |A| and ε is optimal up to logarithmic factors. One of questions asked in [Jin et al, 2020] is whether their result can be generalized to the function approximation setting.

This paper studies linear function approximation. Linear MDP is the setting where both the transition and the reward are linear functions of a given feature extractor. Recently, in the standard RL setting, many works [Yang and Wang, 2019, Jin et al, 2019, Cai et al, 2020, Zanette et al, 2019] have provided polynomial sample complexity guarantees for different algorithms in linear MDPs. Technically, our algorithm, which works in the reward-free setting, combines the algorithmic framework in [Jin et al, 2019] with a novel exploration-driven reward function (cf. Section 3). Linear Q∗ is another setting where only the optimal Q-function is assumed to be linear, which is weaker than the assumptions in the linear MDP setting. In the standard RL setting, it is an open problem whether one can use polynomial number of samples to find a near-optimal policy in the linear Q∗ setting [Du et al, 2020a]. Existing upper bounds all require additional assumptions, such as (nearly) deterministic transition Wen and Van Roy [2013], Du et al [2019b, 2020b].

Funding

- RW and RS are supported in part by NSF IIS1763562, AFRL CogDeCON FA875018C0014, and DARPA SAGAMORE HR00111990016
- SSD is supported by NSF grant DMS-1638352 and the Infosys Membership

Reference

- Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pages 1–9, 2012.
- Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–31. JMLR. org, 2017.
- Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261, 2019.
- Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
- András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
- Alex Ayoub, Zeyu Jia, Szepesvari Csaba, Mengdi Wang, and Lin F. Yang. Model-based reinforcement learning with value-targeted regression. arXiv preprint arXiv:2006.01107, 2020.
- Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in neural information processing systems, pages 1471–1479, 2016.
- Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming, volume 5. Athena Scientific Belmont, MA, 1996.
- Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
- Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
- Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, 2020.
- Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. arXiv preprint arXiv:1905.00360, 2019.
- John D Co-Reyes, YuXuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. arXiv preprint arXiv:1806.02813, 2018.
- Cédric Colas, Pierre Fournier, Olivier Sigaud, and Pierre-Yves Oudeyer. Curious: Intrinsically motivated multi-task multi-goal reinforcement learning. 2018.
- Simon S Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudík, and John Langford. Provably efficient rl with rich observations via latent state decoding. arXiv preprint arXiv:1901.09018, 2019a.
- Simon S Du, Yuping Luo, Ruosong Wang, and Hanrui Zhang. Provably efficient q-learning with function approximation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems, pages 8058–8068, 2019b.
- Simon S. Du, Sham M. Kakade, Ruosong Wang, and Lin F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020a. URL https://openreview.net/forum?id=r1genAVKPB.
- Simon S Du, Jason D Lee, Gaurav Mahajan, and Ruosong Wang. Agnostic q-learning with function approximation in deterministic systems: Tight bounds on approximation error and sample complexity. arXiv preprint arXiv:2002.07125, 2020b.
- Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
- Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. arXiv preprint arXiv:1705.06366, 2017.
- Elad Hazan, Sham M Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. arXiv preprint arXiv:1812.02690, 2018.
- Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
- Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
- Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, 2020.
- William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
- Sobhan Miryoosefi, Kianté Brantley, Hal Daume III, Miro Dudik, and Robert E Schapire. Reinforcement learning with convex constraints. In Advances in Neural Information Processing Systems, pages 14070–14079, 2019.
- Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. arXiv preprint arXiv:1911.05815, 2019.
- Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
- Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pages 9191–9200, 2018.
- Pierre-Yves Oudeyer, Frdric Kaplan, and Verena V Hafner. Intrinsic motivation systems for autonomous mental development. IEEE transactions on evolutionary computation, 11(2):265–286, 2007.
- Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017.
- Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
- Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
- Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
- Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
- Ruosong Wang, Ruslan Salakhutdinov, and Lin F Yang. Provably efficient reinforcement learning with general value function approximation. arXiv preprint arXiv:2005.10804, 2020.
- Zheng Wen and Benjamin Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
- Lin F. Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.
- Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of complexity. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 222–227. IEEE, 1977.
- Andrea Zanette, David Brandfonbrener, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. arXiv preprint arXiv:1911.00567, 2019.
- To prove Lemma 3.1, we need a concentration lemma similar to Lemma B.3 in [Jin et al., 2019].
- Proof. The proof is nearly identical to that of Lemma B.3 in [Jin et al., 2019]. The only deference in our case is that we have a different reward functions at different episodes. However, note that in our case rhk(·, ·) = ukh(·, ·)/H
- a for some Λ ∈ Rd×d, and w ∈ Rd. Therefore, the value function shares exactly the same function class as that in Lemma D.6 in [Jin et al., 2019]. The rest of the proof follow similarly.

Tags

Comments