Optimal and Adaptive Non-Stationary Dueling Bandits Under a Generalized Borda Criterion
CoRR(2024)
摘要
In dueling bandits, the learner receives preference feedback between arms,
and the regret of an arm is defined in terms of its suboptimality to a winner
arm. The more challenging and practically motivated non-stationary variant of
dueling bandits, where preferences change over time, has been the focus of
several recent works (Saha and Gupta, 2022; Buening and Saha, 2023; Suk and
Agarwal, 2023). The goal is to design algorithms without foreknowledge of the
amount of change.
The bulk of known results here studies the Condorcet winner setting, where an
arm preferred over any other exists at all times. Yet, such a winner may not
exist and, to contrast, the Borda version of this problem (which is always
well-defined) has received little attention. In this work, we establish the
first optimal and adaptive Borda dynamic regret upper bound, which highlights
fundamental differences in the learnability of severe non-stationarity between
Condorcet vs. Borda regret objectives in dueling bandits.
Surprisingly, our techniques for non-stationary Borda dueling bandits also
yield improved rates within the Condorcet winner setting, and reveal new
preference models where tighter notions of non-stationarity are adaptively
learnable. This is accomplished through a novel generalized Borda score
framework which unites the Borda and Condorcet problems, thus allowing
reduction of Condorcet regret to a Borda-like task. Such a generalization was
not previously known and is likely to be of independent interest.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要