Tuning Continual Exploration in Reinforcement Learning∗ (Draft manuscript submitted for publication)

msra

引用 23|浏览3
暂无评分
摘要
This paper presents a model allowing to tune continual exploration in an opti- mal way by integrating exploration and exploitation in a common framework. It first quantifies the rate of exploration by defining the degree of exploration of a state as the probability-distribution entropy for choosing an admissible action. Then, the exploration/exploitation tradeoff is stated as a global optimization problem: find the exploration strategy that minimizes the expected cumulated cost, while maintaining fixed degrees of exploration at same nodes. In other words, "exploitation"is maximized for constant "exploration". This formula- tion leads to a set of nonlinear updating rules reminiscent of the value-iteration algorithm. Convergence of these rules to a local minimum can be proved for a stationary environment. Interestingly, in the deterministic case, when there is no exploration, these equations reduce to the Bellman equations for finding the shortest path while, when it is maximum, a full "blind"exploration is per- formed. We also show that, if the graph of states is directed and acyclic, the nonlinear equations can easily be solved by performing a single backward pass from the destination state. Stochastic shortest path problems as well as dis- counted problems are also examined. The theoretical results are confirmed by simple simulations showing that this exploration strategy outperforms the naive ǫ-greedy and Boltzmann strategies.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要