Optimism in Reinforcement Learning with Generalized Linear Function Approximation

ICLR, 2021.

Cited by: 10|Views124
EI
Weibo:
A provably efficient (statistically and computationally) algorithm for reinforcement learning with generalized linear function approximation and no explicit dynamics assumptions.

Abstract:

We design a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation. We analyze the algorithm under a new expressivity assumption that we call ``optimistic closure,\u0027\u0027 which is strictly weaker than assumptions from prior analyses for the linear setting. With optimistic c...More

Code:

Data:

Full Text
Bibtex
Weibo
Introduction
  • The authors study episodic reinforcement learning problems with infinitely large state spaces, where the agent must use function approximation to generalize across states while simultaneously engaging in strategic exploration.
  • With linear function approximation, Yang & Wang (2019) and Jin et al (2019) show that the optimism principle can yield provably sample-efficient algorithms, when the environment dynamics satisfy certain linearity properties.
  • In Section 3 the authors study optimistic closure in detail and verify that it is strictly weaker than the recently proposed Linear MDP assumption.
Highlights
  • We study episodic reinforcement learning problems with infinitely large state spaces, where the agent must use function approximation to generalize across states while simultaneously engaging in strategic exploration
  • This paper presents a provably efficient reinforcement learning algorithm that approxi√mates the Q function with a generalized linear model
  • We prove that the algorithm obtains O(H d3T ) regret under mild regularity conditions and a new expressivity condition that we call optimistic closure
  • Using the fact that Corollary 3 applies beyond generalized linear models (GLMs), can we develop algorithms that can employ general function classes? While such algorithms do exist for the contextual bandit setting (Foster et al, 2018), it seems quite difficult to generalize this analysis to multi-step reinforcement learning
  • An important direction is to investigate weaker assumptions that enable provably efficient reinforcement learning with function approximation
Results
  • The algorithms developed here can accommodate function classes beyond generalized linear models, but they are still relatively impractical and the more practical ones require strong dynamics assumptions (Du et al, 2019b).
  • Both papers study MDPs with certain linear dynamics assumptions and use linear function approximation to obtain provably efficient algorithms.
  • Jin et al (2019) hint at optimistic closure as a weakening of their Linear MDP assumption and remark that their guarantees continues to hold under this weaker assumption.
  • √ Linear MDPs are studied by Jin et al (2019), who establish a T -type regret bound for an optimistic algorithm.
  • The authors show that optimistic closure (Assumption 2) is strictly weaker than assuming the environment is a linear MDP.
  • The authors have that optimistic closure is strictly weaker than the linear MDP assumption from Jin et al (2019).
  • The algorithm uses dynamic programming to maintain optimistic Q function estimates {Qh,t}h≤H,t≤T for each time point h and each episode t.
  • The result states that LSVI-UCB enjoys T -regret for any episodic MDP problem and any GLM, provided that the regularity conditions are satisfied and that optimistic closure holds.
  • These assumptions are relatively mild, encompassing the tabular setting and prior work on linear function approximation.
  • In the linear MDP setting of Jin et al (2019), the authors use the identity link function so that K = κ = 1 and M = 1, and the authors are guaranteed to satisfy Assumption 2.
Conclusion
  • The authors' algorithm and analysis address problems with infinitely large state spaces and other settings that are significantly more complex than tabular MDPs, which the authors believe is more important than recovering the optimal guarantee for tabular MDPs. 3We use O (·) to suppress factors of M, K, κ, Γ and any logarithmic dependencies on the arguments.
  • This paper presents a provably efficient reinforcement learning algorithm that approxi√mates the Q function with a generalized linear model.
  • Further they represent the first statistically and computationally efficient algorithms for reinforcement learning with generalized linear function approximation, without explicit dynamics assumptions.
Summary
  • The authors study episodic reinforcement learning problems with infinitely large state spaces, where the agent must use function approximation to generalize across states while simultaneously engaging in strategic exploration.
  • With linear function approximation, Yang & Wang (2019) and Jin et al (2019) show that the optimism principle can yield provably sample-efficient algorithms, when the environment dynamics satisfy certain linearity properties.
  • In Section 3 the authors study optimistic closure in detail and verify that it is strictly weaker than the recently proposed Linear MDP assumption.
  • The algorithms developed here can accommodate function classes beyond generalized linear models, but they are still relatively impractical and the more practical ones require strong dynamics assumptions (Du et al, 2019b).
  • Both papers study MDPs with certain linear dynamics assumptions and use linear function approximation to obtain provably efficient algorithms.
  • Jin et al (2019) hint at optimistic closure as a weakening of their Linear MDP assumption and remark that their guarantees continues to hold under this weaker assumption.
  • √ Linear MDPs are studied by Jin et al (2019), who establish a T -type regret bound for an optimistic algorithm.
  • The authors show that optimistic closure (Assumption 2) is strictly weaker than assuming the environment is a linear MDP.
  • The authors have that optimistic closure is strictly weaker than the linear MDP assumption from Jin et al (2019).
  • The algorithm uses dynamic programming to maintain optimistic Q function estimates {Qh,t}h≤H,t≤T for each time point h and each episode t.
  • The result states that LSVI-UCB enjoys T -regret for any episodic MDP problem and any GLM, provided that the regularity conditions are satisfied and that optimistic closure holds.
  • These assumptions are relatively mild, encompassing the tabular setting and prior work on linear function approximation.
  • In the linear MDP setting of Jin et al (2019), the authors use the identity link function so that K = κ = 1 and M = 1, and the authors are guaranteed to satisfy Assumption 2.
  • The authors' algorithm and analysis address problems with infinitely large state spaces and other settings that are significantly more complex than tabular MDPs, which the authors believe is more important than recovering the optimal guarantee for tabular MDPs. 3We use O (·) to suppress factors of M, K, κ, Γ and any logarithmic dependencies on the arguments.
  • This paper presents a provably efficient reinforcement learning algorithm that approxi√mates the Q function with a generalized linear model.
  • Further they represent the first statistically and computationally efficient algorithms for reinforcement learning with generalized linear function approximation, without explicit dynamics assumptions.
Related work
  • The majority of the theoretical results for reinforcement learning focus on the tabular setting where the state space is finite and sample complexities scaling polynomially with |S| are tolerable (Kearns & Singh, 2002; Brafman & Tennenholtz, 2002; Strehl et al, 2006). Indeed, by now there are a number of algorithms that achieve strong guarantees in this setting (Dann et al, 2017; Azar et al, 2017; Jin et al, 2018; Simchowitz & Jamieson, 2019). Via Fact 2, our results apply to this setting, and indeed our algorithm can be viewed as a generalization of the canonical tabular algorithm (Azar et al, 2017; Dann et al, 2017; Simchowitz & Jamieson, 2019) to the function approximation setting.2

    Turning to the function approximation setting, several other results concern function approximation in settings where exploration is not an issue, including the infinite-data regime (Munos, 2003; Farahmand et al, 2010) and the “batch RL” setting where the agent does not control the data-collection process (Munos & Szepesvari, 2008; Antos et al, 2008; Chen & Jiang, 2019). While the details differ, all of these results require that the function class satisfy some form of (approximate) closure with respect to the Bellman operator. As an example, one assumption is that T (g) ∈ G for all g ∈ G, with an appropriately defined approximate variant (Chen & Jiang, 2019). These results therefore provide motivation for our optimistic closure assumption. While optimistic closure is stronger than the assumptions in these works, we emphasize that we are also addressing exploration, so our setting is also significantly more challenging.
Reference
  • Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011.
    Google ScholarLocate open access versionFindings
  • Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In International Conference on Artificial Intelligence and Statistics, 2012.
    Google ScholarLocate open access versionFindings
  • Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning. arXiv preprint arXiv:2007.08459, 2020.
    Findings
  • Andras Antos, Csaba Szepesvari, and Remi Munos. Learning near-optimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 2008.
    Google ScholarLocate open access versionFindings
  • Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin F Yang. Model-based reinforcement learning with value-targeted regression. arXiv preprint arXiv:2006.01107, 2020.
    Findings
  • Mohammad Gheshlaghi Azar, Ian Osband, and Remi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996.
    Google ScholarLocate open access versionFindings
  • Ronen I. Brafman and Moshe Tennenholtz. R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 2002.
    Google ScholarLocate open access versionFindings
  • Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830, 2019.
    Findings
  • Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Simon S Du, Sham M Kakade, Ruosong Wang, and Lin F Yang. Is a good representation sufficient for sample efficient reinforcement learning? arXiv:1910.03016, 2019a.
    Findings
  • Simon S Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudık, and John Langford. Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, 2019b.
    Google ScholarLocate open access versionFindings
  • Amir-massoud Farahmand, Csaba Szepesvari, and Remi Munos. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, 2010.
    Google ScholarLocate open access versionFindings
  • Sarah Filippi, Olivier Cappe, Aurelien Garivier, and Csaba Szepesvari. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, 2010.
    Google ScholarLocate open access versionFindings
  • Dylan J Foster, Alekh Agarwal, Miroslav Dudık, Haipeng Luo, and Robert E Schapire. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. arXiv:1812.02900, 2018.
    Findings
  • Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv:1907.05388, 2019.
    Findings
  • Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 2002.
    Google ScholarFindings
  • Akshay Krishnamurthy, Alekh Agarwal, and John Langford. PAC reinforcement learning with rich observations. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Francisco S Melo and M Isabel Ribeiro. Q-learning with linear function approximation. In Conference on Learning Theory, 2007.
    Google ScholarLocate open access versionFindings
  • Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pp. 2010–2020, 2020.
    Google ScholarLocate open access versionFindings
  • Remi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, 2003.
    Google ScholarLocate open access versionFindings
  • Remi Munos and Csaba Szepesvari. Finite-time bounds for fitted value iteration. The Journal of Machine Learning Research, 2008.
    Google ScholarLocate open access versionFindings
  • Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In Advances in neural information processing systems, 2016.
    Google ScholarLocate open access versionFindings
  • Victor H Pena, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008.
    Google ScholarFindings
  • Max Simchowitz and Kevin Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. arXiv:1905.03814, 2019.
    Findings
  • Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. PAC model-free reinforcement learning. In International Conference on Machine Learning, 2006.
    Google ScholarLocate open access versionFindings
  • Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. #Exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Lin F Yang and Mengdi Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv:1905.10389, 2019.
    Findings
  • Andrea Zanette, David Brandfonbrener, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. arXiv preprint arXiv:1911.00567, 2019.
    Findings
  • Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error. arXiv preprint arXiv:2003.00153, 2020a.
    Findings
  • Andrea Zanette, Alessandro Lazaric, Mykel J Kochenderfer, and Emma Brunskill. Provably efficient reward-agnostic navigation with linear value iteration. arXiv preprint arXiv:2008.07737, 2020b.
    Findings
  • Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted mdps with feature mapping. arXiv preprint arXiv:2006.13165, 2020.
    Findings
  • Related results appear elsewhere in the literature focusing on the tabular setting, see e.g., Simchowitz & Jamieson (2019).
    Google ScholarFindings
Your rating :
0

 

Tags
Comments