Use of variance estimation in the multi-armed bandit problem

msra(2006)

引用 33|浏览6
暂无评分
摘要
An important aspect of most decision making problems concerns the appro- priate balance between exploitation (acting optimally according to the par- tial knowledge acquired so far) and exploration of the environment (acting sub-optimally in order to reflne the current knowledge and improve future decisions). A typical example of this so-called exploration versus exploita- tion dilemma is the multi-armed bandit problem, for which many strategies have been developed. Here we investigate policies based the choice of the arm having the highest upper-confldence bound, where the bound takes into account the empirical variance of the difierent arms. Such an algo- rithm was found earlier to outperform its peers in a series of numerical experiments. The main contribution of this paper is the theoretical investi- gation of this algorithm. Our contribution here is twofold. First, we prove that with probability at least 1 ¡ fl, the regret after n plays of a variant of the UCB algorithm (called fl-UCB) is upper-bounded by a constant, that scales linearly with log(1=fl), but which is independent from n. We also analyse a variant which is closer to the algorithm suggested earlier. We prove a logarithmic bound on the expected regret of this algorithm and argue that the bound scales favourably with the variance of the suboptimal arms.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要