# Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning

NIPS 2020, 2020.

Weibo:

Abstract:

We study how to use unsupervised learning for efficient exploration in reinforcement learning with rich observations generated from a small number of latent states. We present a novel algorithmic framework that is built upon two components: an unsupervised learning algorithm and a no-regret reinforcement learning algorithm. We show that...More

Code:

Data:

Introduction

- Reinforcement learning (RL) is the framework of learning to control an unknown system through trial and error.
- It takes as inputs the observations of the environment and outputs a policy, i.e., a mapping from observations to actions, to maximize the cumulative rewards.
- Function approximation scheme is adopted such that essential quantities for policy improvement, e.g. state-action values, can be generalized from limited observed data to the whole observation space.
- The use of function approximation alone does not resolve the exploration problem (Du et al, 2019b)

Highlights

- Reinforcement learning (RL) is the framework of learning to control an unknown system through trial and error
- Modern RL applications often need to deal with huge observation spaces such as those consist of images or texts, which makes it challenging or impossible to fully explore the environment in a direct way
- We develop a novel framework for RL problems with rich observations that are generated from a small number of latent states
- We prove that as long as the input unsupervised learning and tabular RL algorithms each has a polynomial sample complexity guarantee, our framework returns a near-optimal policy with a sample complexity polynomial in the number of latent states
- We prove that as long as the unsupervised learning oracle and the tabular RL algorithm each has a polynomial sample complexity, our algorithm finds a near-optimal policy with sample complexity polynomial in the number of latent states, which is significantly smaller than the number of possible observations (cf
- The title for each subfigure records the length of the horizon, switch parameter α in actions, and the unsupervised learning method we apply for URL

Results

- The results are shown in Figure 1, 2, and 3, where x-axis is the number of running episodes and y-axis is the average rewards per episode.
- In LockBernoulli, OracleQ-obs and QLearning-obs are far from being optimal even for small-horizon cases.
- URL outperforms PCID in most cases.
- When H = 20, the authors observe a probability of 80% that URL returns near-optimal values for α = 0.2 and 0.5.
- In LockGaussian, OracleQ-obs and QLearning-obs are omitted due to infinitely many observations.
- In most cases URL outperforms PCID.
- For H = 20, the authors observe a probability of > 75% that URL returns a near-optimal policy for α = 0.2 and 0.5

Conclusion

- The current paper gave a general framework that turns an unsupervised learning algorithm and a no-regret tabular RL algorithm into an algorithm for RL problems with huge observation spaces.
- The authors provided theoretically analysis to show it is provably efficient.
- The authors conducted numerical experiments to show the effectiveness of the framework in practice.
- This result complements empirical findings that unsupervised learning can guide exploration.
- An interesting future theoretical direction is to characterize the optimal sample complexity under the assumptions

Summary

## Introduction:

Reinforcement learning (RL) is the framework of learning to control an unknown system through trial and error.- It takes as inputs the observations of the environment and outputs a policy, i.e., a mapping from observations to actions, to maximize the cumulative rewards.
- Function approximation scheme is adopted such that essential quantities for policy improvement, e.g. state-action values, can be generalized from limited observed data to the whole observation space.
- The use of function approximation alone does not resolve the exploration problem (Du et al, 2019b)
## Objectives:

This approach has not been theoretically justified.- The authors aim to answer this question:.
- At episode k of A , the goal is to simulate a trajectory of πk running on the underlying
## Results:

The results are shown in Figure 1, 2, and 3, where x-axis is the number of running episodes and y-axis is the average rewards per episode.- In LockBernoulli, OracleQ-obs and QLearning-obs are far from being optimal even for small-horizon cases.
- URL outperforms PCID in most cases.
- When H = 20, the authors observe a probability of 80% that URL returns near-optimal values for α = 0.2 and 0.5.
- In LockGaussian, OracleQ-obs and QLearning-obs are omitted due to infinitely many observations.
- In most cases URL outperforms PCID.
- For H = 20, the authors observe a probability of > 75% that URL returns a near-optimal policy for α = 0.2 and 0.5
## Conclusion:

The current paper gave a general framework that turns an unsupervised learning algorithm and a no-regret tabular RL algorithm into an algorithm for RL problems with huge observation spaces.- The authors provided theoretically analysis to show it is provably efficient.
- The authors conducted numerical experiments to show the effectiveness of the framework in practice.
- This result complements empirical findings that unsupervised learning can guide exploration.
- An interesting future theoretical direction is to characterize the optimal sample complexity under the assumptions

Related work

- In this section, we review related provably efficient RL algorithms. We remark that we focus on environments that require explicit exploration. With certain assumptions of the environment, e.g., the existence of a good exploration policy or the distribution over the initial state is sufficiently diverse, one does not need to explicitly explore (Munos, 2005; Antos et al, 2008; Geist et al, 2019; Kakade and Langford, 2002; Bagnell et al, 2004; Scherrer and Geist, 2014; Agarwal et al, 2019; Yang et al, 2019b; Chen and Jiang, 2019). Without these assumptions, the problem can require an exponential number of samples, especially for policy-based methods (Du et al, 2019b).

Exploration is needed even in the most basic tabular setting. There is a substantial body of work on provably efficient tabular RL (Agrawal and Jia, 2017; Jaksch et al, 2010; Kakade et al, 2018; Azar et al, 2017; Kearns and Singh, 2002; Dann et al, 2017; Strehl et al, 2006; Jin et al, 2018; Simchowitz and Jamieson, 2019; Zanette and Brunskill, 2019). A common strategy is to use UCB bonus to encourage exploration in less-visited states and actions. One can also study RL in metric spaces (Pazis and Parr, 2013; Song and Sun, 2019; Yang et al, 2019a). However, in general, this type of algorithms has an exponential dependence on the state dimension.

Funding

- Fei Feng and Wotao Yin were supported by AFOSR MURI FA9550-18-10502, NSF DMS-1720237, and ONR N0001417121
- Du is supported by NSF grant DMS-1638352 and the Infosys Membership

Study subjects and analysis

cases: 3

The good actions are randomly assigned for every state. We consider three cases: α = 0, α = 0.2, and α = 0.5. In LockBernoulli, the observation space is {0, 1}H+3 where the first 3 coordinates are reserved for the one-hot encoding of the latent state and the last H coordinates are drawn i.i.d from Bernoulli(0.5)

Reference

- Achlioptas, D. and McSherry, F. (2005). On spectral learning of mixtures of distributions. In International Conference on Computational Learning Theory, pages 458–469. Springer.
- Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261.
- Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org.
- Azizzadenesheli, K., Brunskill, E., and Anandkumar, A. (2018). Efficient exploration through bayesian deep Q-networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9.
- Bagnell, J. A., Kakade, S. M., Schneider, J. G., and Ng, A. Y. (2004). Policy search by dynamic programming. In Advances in neural information processing systems, pages 831–838.
- Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479.

Tags

Comments