An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning.

JOURNAL OF MACHINE LEARNING RESEARCH(2016)

引用 270|浏览146
暂无评分
摘要
In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD(lambda)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD(lambda), and GQ(lambda). Compared to these methods, our emphatic TD(lambda) is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.
更多
查看译文
关键词
Temporal-difference learning,Off-policy learning,Function approximation,Stability,Convergence
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要