Quantifying the Burden of Exploration and the Unfairness of Free Riding.

SODA '20: ACM-SIAM Symposium on Discrete Algorithms Salt Lake City Utah January, 2020(2020)

引用 4|浏览62
暂无评分
摘要
We consider the multi-armed bandit setting with a twist. Rather than having just one decision maker deciding which arm to pull in each round, we have n different decision makers (agents). In the simple stochastic setting, we show that a "free-riding" agent observing another "self-reliant" agent can achieve just O(1) regret, as opposed to the regret lower bound of Ω(log t) when one decision maker is playing in isolation. This result holds whenever the self-reliant agent's strategy satisfies either one of two assumptions: (1) each arm is pulled at least γ ln t times in expectation for a constant γ that we compute, or (2) the self-reliant agent achieves o(t) realized regret with high probability. Both of these assumptions are satisfied by standard zero-regret algorithms. Under the second assumption, we further show that the free rider only needs to observe the number of times each arm is pulled by the self-reliant agent, and not the rewards realized. In the linear contextual setting, each arm has a distribution over parameter vectors, each agent has a context vector, and the reward realized when an agent pulls an arm is the inner product of that agent's context vector with a parameter vector sampled from the pulled arm's distribution. We show that the free rider can achieve O(1) regret in this setting whenever the free rider's context is a small (in L2-norm) linear combination of other agents' contexts and all other agents pull each arm Ω(log t) times with high probability. Again, this condition on the self-reliant players is satisfied by standard zero-regret algorithms like UCB. We also prove a number of lower bounds.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要