Off-policy temporal difference learning with distribution adaptation in fast mixing chains

Soft Comput.(2017)

引用 1|浏览22
暂无评分
摘要
In this paper, we investigate the possibility of covariate shift adaptation in off-policy temporal difference learning for the class of fast mixing Markov chains. Off-policy evaluation algorithms in reinforcement learning such as off-policy least squares temporal difference (LSTD) deal with the problem of evaluating a target policy different from the sampling (or behavior) policy. Off-policy LSTD may result in poor quality of solution due to the shift among stationary distributions of the chains induced by following the target and behavior policies. Previous works—least squares temporal difference–distribution optimization (LSTD-DO) and the recently proposed emphatic TD—each tackles this problem by mapping distribution of states collected following the behavior policy (i.e. off-policy samples) to a new different distribution with better LSTD solution. In this paper, we consider off-policy LSTD in the class of target Markov chains with fast mixing time. For this class of problems, we propose adapting the distribution of off-policy state samples to the distribution of state samples after transition model adaptation, using a regularized covariate shift adaptation algorithm called least squares importance fitting. Empirical evaluations of our proposed approach on two classes of fast mixing chains show promising results in comparison with LSTD-DO and unadapted off-policy LSTD as the number of samples increases.
更多
查看译文
关键词
Reinforcement learning, Off-policy evaluation, Least squares temporal difference, Covariate shift adaptation, Mixing time
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要