Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo
arxiv(2023)
摘要
We present a scalable and effective exploration strategy based on Thompson
sampling for reinforcement learning (RL). One of the key shortcomings of
existing Thompson sampling algorithms is the need to perform a Gaussian
approximation of the posterior distribution, which is not a good surrogate in
most practical settings. We instead directly sample the Q function from its
posterior distribution, by using Langevin Monte Carlo, an efficient type of
Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy
gradient descent updates to learn the exact posterior distribution of the Q
function, which makes our approach easy to deploy in deep RL. We provide a
rigorous theoretical analysis for the proposed method and demonstrate that, in
the linear Markov decision process (linear MDP) setting, it has a regret bound
of Õ(d^3/2H^3/2√(T)), where d is the dimension of the
feature mapping, H is the planning horizon, and T is the total number of
steps. We apply this approach to deep RL, by using Adam optimizer to perform
gradient updates. Our approach achieves better or similar results compared with
state-of-the-art deep RL algorithms on several challenging exploration tasks
from the Atari57 suite.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要