A Restless Bandit With No Observable States For Recommendation Systems And Communication Link Scheduling

2015 54TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC)(2015)

引用 9|浏览6
暂无评分
摘要
A restless bandit is used to model a user's interest in a topic or item. The interest evolves as a Markov chain whose transition probabilities depend on the action ( display the ad or desist) in a time step. A unit reward is obtained if the ad is displayed and if the user clicks on the ad. If no ad is displayed then a fixed reward is assumed. The probability of click-through is determined by the state of the Markov chain. The recommender never gets to observe the state but in each time step it has a belief, denoted by pi(t); about the state of the Markov chain. pi(t) evolves as a function of the action and the signal from each state. For the one-armed restless bandit with two states, we characterize the policy that maximizes the infinite horizon discounted reward. We first characterize the value function as a function of the system parameters and then characterize the optimal policies for different ranges of the parameters. We will see that the Gilbert-Elliot channel in which the two states have different success probabilities becomes a special case. For one special case, we argue that the optimal policy is of the threshold type with one threshold; extensive numerical results indicate that this may be true in general.
更多
查看译文
关键词
Markov processes,History,Transmitters,Linear programming,Process control,Analytical models,Context modeling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要