Tuning Continual Exploration in Reinforcement Learning

msra(2006)

引用 28|浏览7
暂无评分
摘要
This paper presents a model allowing to tune continual exploration in an optimal way by integrating exploration and exploitation in a common frame- work. It first quantifies exploration by defining the degree of exploration of a state as the entropy of the probability distribution for choosing an ad- missible action. Then, the exploration/exploitation tradeoff is formulated as a global optimization problem: find the exploration strategy that minimizes the expected cumulated cost, while maintaining fixed degrees of exploration at the states. In other words, exploitation is maximized for constant exploration. This formulation leads to a set of nonlinear iterative equations reminiscent of the value-iteration algorithm. Their convergence to a local minimum can be proved for a stationary environment. Interestingly, in the deterministic case, when there is no exploration, these equations reduce to the Bellman equations for finding the shortest path. If the graph of states is directed and acyclic, the nonlinear equations can easily be solved by a single backward pass from the destination state. Stochastic shortest-path problems and discounted prob- lems are also examined, and they are compared to the SARSA algorithm. The theoretical results are confirmed by simple simulations showing that the pro- posed exploration strategy outperforms the ǫ-greedy and the naive Boltzmann strategies.
更多
查看译文
关键词
nonlinear equation,cumulant,global optimization,reinforcement learning,bellman equation,probability distribution,shortest path,value iteration
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要