Adapting to a Changing Environment: the Brownian Restless Bandits

COLT(2008)

引用 124|浏览32
暂无评分
摘要
In the multi-armed bandit (MAB) problem there are k distributions associated with the rewards of playing each of k strategies (slot machine arms). The reward distributions are initially unknown to the player. The player iteratively plays one strat- egy per round, observes the associated reward, and decides on the strategy for the next iteration. The goal is to maximize the reward by balancing ex- ploitation: the use of acquired information, with exploration: learning new information. We introduce and study a dynamic MAB prob- lem in which the reward functions stochastically and gradually change in time. Specifically, the ex- pected reward of each arm follows a Brownian mo- tion, a discrete random walk, or similar processes. In this setting a player has to continuously keep ex- ploring in order to adapt to the changing environ- ment. Our formulation is (roughly) a special case of the notoriously intractable restless MAB prob- lem. Our goal here is to characterize the cost of learn- ing and adapting to the changing environment, in terms of the stochastic rate of the change. We con- sider an infinite time horizon, and strive to min- imize the average cost per step which we define with respect to a hypothetical algorithm that at ev- ery step plays the arm with the maximum expected reward at this step. A related line of work on the adversarial MAB problem used a significantly weaker benchmark, the best time-invariant policy. The dynamic MAB problem models a variety of practical online, game-against- nature type opti- mization settings. While building on prior work, algorithms and steady-state analysis for the dynamic setting require a novel approach based on different stochastic tools.
更多
查看译文
关键词
random walk,machine learning,steady state analysis,multi armed bandit
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要