Overfitting and Optimization in Offline Policy Learning

arxiv(2020)

引用 0|浏览82
暂无评分
摘要
We consider the task of policy learning from an offline dataset generated by some behavior policy. We analyze the two most prominent families of algorithms for this task: policy optimization and Q-learning. We demonstrate that policy optimization suffers from two problems, overfitting and spurious minima, that do not appear in Q-learning or full-feedback problems (i.e. cost-sensitive classification). Specifically, we describe the phenomenon of ``bandit overfitting'' in which an algorithm overfits based on the actions observed in the dataset, and show that it affects policy optimization but not Q-learning. Moreover, we show that the policy optimization objective suffers from spurious minima even with linear policies, whereas the Q-learning objective is convex for linear models. We empirically verify the existence of both problems in realistic datasets with neural network models.
更多
查看译文
关键词
offline contextual bandits,overparameterized models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要