Efficient Contextual Bandits in Non-stationary Worlds.

COLT(2018)

引用 96|浏览94
暂无评分
摘要
Most contextual bandit algorithms minimize regret against the best fixed policy, a questionable benchmark for non-stationary environments that are ubiquitous in applications. In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i.i.d. problems with sophisticated statistical tests so as to dynamically adapt to a change in distribution. We analyze various standard notions of regret suited to non-stationary environments for these algorithms, including interval regret, switching regret, and dynamic regret. When competing with the best policy at each time, one of our algorithms achieves regret (mathcal{O}(sqrt{ST})) if there are (T) rounds with (S) stationary periods, or more generally (mathcal{O}(Delta^{1/3}T^{2/3})) where (Delta) is some non-stationarity measure. These results almost match the optimal guarantees achieved by an inefficient baseline that is a variant of the classic Exp4 algorithm. The dynamic regret result is also the first one for efficient and fully adversarial contextual bandit. Furthermore, while the results above require tuning a parameter based on the unknown quantity (S) or (Delta), we also develop a parameter free algorithm achieving regret (min{S^{1/4}T^{3/4}, Delta^{1/5}T^{4/5}}). This improves and generalizes the best existing result (Delta^{0.18}T^{0.82}) by Karnin and Anava (2016) which only holds for the two-armed bandit problem.
更多
查看译文
关键词
contextual,worlds,non-stationary
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要