Pseudo Dyna-Q - A Reinforcement Learning Framework for Interactive Recommendation.

WSDM '20: The Thirteenth ACM International Conference on Web Search and Data Mining Houston TX USA February, 2020(2020)

引用 111|浏览404
暂无评分
摘要
Applying reinforcement learning (RL) in recommender systems is attractive but costly due to the constraint of the interaction with real customers, where performing online policy learning through interacting with real customers usually harms customer experiences. A practical alternative is to build a recommender agent offline from logged data, whereas directly using logged data offline leads to the problem of selection bias between logging policy and the recommendation policy. The existing direct offline learning algorithms, such as Monte Carlo methods and temporal difference methods are either computationally expensive or unstable on convergence. To address these issues, we propose Pseudo Dyna-Q (PDQ). In PDQ, instead of interacting with real customers, we resort to a customer simulator, referred to as the World Model, which is designed to simulate the environment and handle the selection bias of logged data. During policy improvement, the World Model is constantly updated and optimized adaptively, according to the current recommendation policy. This way, the proposed PDQ not only avoids the instability of convergence and high computation cost of existing approaches but also provides unlimited interactions without involving real customers. Moreover, a proved upper bound of empirical error of reward function guarantees that the learned offline policy has lower bias and variance. Extensive experiments demonstrated the advantages of PDQ on two real-world datasets against state-of-the-arts methods.
更多
查看译文
关键词
Pseudo Dyna-Q, Customer Simulator, Model-Based Reinforcement Learning, Offline Policy Learning, Recommender Systems
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要