Adaptive Reward-Free Exploration

Kaufmann Emilie,Ménard Pierre,Domingues Omar Darwiche,Jonsson Anders,Leurent Edouard,Valko Michal

ALT（2021）

引用 51|浏览119

暂无评分

摘要

Reward-free exploration is a reinforcement learning setting recently studied by Jin et al., who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs $\mathcal{O}\left(({SAH^4}/{\varepsilon^2})\ln(1/\delta)\right)$ episodes to output, with probability $1-\delta$, an $\varepsilon$-approximation of the optimal policy for any reward function. We empirically compare it to oracle strategies using a generative model.

查看译文

关键词

exploration,reward-free

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要