Latent Contextual Bandits: A Non-Negative Matrix Factorization Approach.

arXiv: Learning(2016)

引用 25|浏览56
暂无评分
摘要
We consider the stochastic contextual bandit problem with a large number of observed contexts and arms, but with a latent low-dimensional structure across contexts. This low dimensional(latent) structure encodes the fact that both the observed contexts and the mean rewards from the arms are convex mixtures of a small number of underlying latent contexts. At each time, we are presented with an observed context; the bandit problem is to determine the corresponding arm to pull in order to minimize regret. Assuming a separable and low rank latent context vs. mean-reward} matrix, we employ non-negative matrix factorization(NMF) techniques on sub-sampled estimates of matrix entries (estimates constructed from careful arm sampling) to efficiently discover the underlying factors. This estimation lies at the core of our proposed $epsilon$-greedy NMF-Bandit algorithm that switches between arm exploration to reconstruct the reward matrix, and exploitation of arms using the reconstructed matrix in order to minimize regret. We identify singular value conditions on the non-negative factors under which the NMF-Bandit algorithm has $mathcal{O}(Ltext{poly}(m,log K)log{T})$ regret where $L$ is the number of observed contexts, $K$ is the number of arms, and $m$ is the number of latent contexts. We further propose a class of generative models that satisfy our sufficient conditions, and derive a lower bound that matches our achievable bounds up to a $mathrm{poly}(m,log K)$ factor. Finally, we validate the NMF-bandit algorithm on synthetic data-sets.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要