Recursive Least-Squares Temporal Difference With Gradient Correction

IEEE Transactions on Cybernetics(2021)

引用 3|浏览14
暂无评分
摘要
Since the late 1980s, temporal difference (TD) learning has dominated the research area of policy evaluation algorithms. However, the demand for the avoidance of TD defects, such as low data-efficiency and divergence in off-policy learning, has inspired the studies of a large number of novel TD-based approaches. Gradient-based and least-squares-based algorithms comprise the major part of these new approaches. This paper aims to combine advantages of these two categories to derive an efficient policy evaluation algorithm with O( n 2 ) per-time-step runtime complexity. The least-squares-based framework is adopted, and the gradient correction is used to improve convergence performance. This paper begins with the revision of a previous O( n 3 ) batch algorithm, least-squares TD with a gradient correction (LS-TDC) to regularize the parameter vector. Based on the recursive least-squares technique, an O( n 2 ) counterpart of LS-TDC called RC is proposed. To increase data efficiency, we generalize RC with eligibility traces. An off-policy extension is also proposed based on importance sampling. In addition, the convergence analysis for RC as well as LS-TDC is given. The empirical results in both on-policy and off-policy benchmarks show that RC has a higher estimation accuracy than that of RLSTD and a significantly lower runtime complexity than that of LSTDC.
更多
查看译文
关键词
Policy evaluation,reinforcement learning (RL),temporal differences (TDs),value function approximation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要