Thompson Sampling with Less Exploration is Fast and Optimal

ICML 2023(2023)

引用 3|浏览31
暂无评分
摘要
We propose $\epsilon$-Exploring Thompson Sampling ($\epsilon$-TS), a modified version of the Thompson Sampling (TS) algorithm for multi-armed bandits. In $\epsilon$-TS, arms are selected greedily based on empirical mean rewards with probability $1-\epsilon$, and based on posterior samples obtained from TS with probability $\epsilon$. Here, $\epsilon\in(0,1)$ is a user-defined constant. By reducing exploration, $\epsilon$-TS improves computational efficiency compared to TS while achieving better regret bounds. We establish that $\epsilon$-TS is both minimax optimal and asymptotically optimal for various popular reward distributions, including Gaussian, Bernoulli, Poisson, and Gamma. A key technical advancement in our analysis is the relaxation of the requirement for a stringent anti-concentration bound of the posterior distribution, which was necessary in recent analyses that achieved similar bounds. As a result, $\epsilon$-TS maintains the posterior update structure of TS while minimizing alterations, such as clipping the sampling distribution or solving the inverse of the Kullback-Leibler (KL) divergence between reward distributions, as done in previous work. Furthermore, our algorithm is as easy to implement as TS, but operates significantly faster due to reduced exploration. Empirical evaluations confirm the efficiency and optimality of $\epsilon$-TS.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要