Continuous Time Bandits with Sampling Costs

arxiv(2023)

引用 0|浏览5
暂无评分
摘要
We consider a continuous time multi-arm bandit problem (CTMAB), where the learner can sample arms any number of times in a given interval and obtain a random reward from each sample, however, increasing the frequency of sampling incurs an additive penalty/cost. Thus, there is a tradeoff between obtaining large reward and incurring sampling cost as a function of the sampling frequency. The goal is to design a learning algorithm that minimizes the regret. We establish lower bounds on the regret achievable with any algorithm, and propose algorithms that achieve the lower bound up to logarithmic factors. For the single arm case, we show that the lower bound on the regret is $\Omega(1/\mu)$ , and an upper bound with regret $O((\log(T/\lambda))^{2}/\mu)$ , where $\mu$ is the mean of the arm, $T$ is the time horizon, and $\lambda$ is the tradeoff parameter between the reward and the sampling cost. With $K$ arms, we show that the lower bound on the regret is $\Omega(K\mu[1]/\Delta^{2})$ , and an upper bound $O(K(\log(T/\lambda))^{2}\mu[1]/\Delta^{2})$ where $\mu$ [1] now represents the mean of the best arm, and $\Delta$ is the difference of the mean of the best and the second-best arm.
更多
查看译文
关键词
continuous time bandits,costs
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要