# Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Kristopher De Asis
Alan Chan
Daniel Graves

national conference on artificial intelligence, 2020.

Keywords:
Markov decision processhorizon methodgeneralized value functionshorizon returnDeep FHQ-learningMore(12+)
Weibo:
We argued that fixed-horizon TD agents are stable under function approximation and have additional predictive power

Abstract:

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a $\textit{fixed}$ number of future time steps. To learn the value function for horizon $h$, these algorithms bootstrap from the value function for horizon $h-1$, or some sho...More

Code:

Data:

0
Introduction
• Temporal difference (TD) methods (Sutton 1988) are an important approach to reinforcement learning (RL) that combine ideas from Monte Carlo estimation and dynamic programming.
• The learned values represent answers to questions about how a signal will accumulate over time, conditioned on a way of behaving
• In control tasks, this signal is the reward sequence, and the values represent an arbitrarily long sum of rewards an agent expects to receive when acting greedily with respect to its current predictions.
• Fixed-horizon agents can approximate infinite-horizon returns arbitrarily well, expand their set of learned horizons freely, and combine forecasts from multiple horizons to make time-sensitive predictions about rewards.
Highlights
• Temporal difference (TD) methods (Sutton 1988) are an important approach to reinforcement learning (RL) that combine ideas from Monte Carlo estimation and dynamic programming
• The RL problem is usually modeled as a Markov decision process (MDP), in which an agent interacts with an environment over a sequence of discrete time steps
• At each time step t, the agent receives information about the environment’s current state, St ∈ S, where S is the set of all possible states in the MDP
• We explore the convergence of fixed-horizon TD (FHTD) formally in Section 4
• We investigated using fixed-horizon returns in place of the conventional infinite-horizon return
• We argued that FHTD agents are stable under function approximation and have additional predictive power
Results
• For h = 1, ..., H, the following ODE system has an equilibrium.
• Wh+1,: = E (r(x, a, y) + wh,:φhy − wh+1,:φx)φTx (27).
• Finding an equilibrium point amounts to solving the following equations for all h: wh+1,:E[φxφTx ] = E[(r(x, a, y) + wh,:φhy )φTx ].
• Since the authors assume that the features are linearly independent, and using the fact that w0,: = 0, the authors can recursively solve these equations to find an equilibrium.
• Let the authors be the equilibrium point generated.
• Define w := w − the authors and substitute into Equation (27) to obtain the following system
Conclusion
• Discussion and future work

In this work, the authors investigated using fixed-horizon returns in place of the conventional infinite-horizon return.
• The authors derived FHTD methods and compared them to their infinite-horizon counterparts in terms of prediction capability, complexity, and performance.
• The authors proved convergence of FHTD methods with linear and general function approximation.
• In a tabular control problem, the authors showed that greedifying with respect to estimates of a short, fixed horizon could outperform doing so with respect to longer horizons.
• The authors demonstrated that FHTD methods can scale well to and perform competitively on a deep reinforcement learning control problem
Summary
• ## Introduction:

Temporal difference (TD) methods (Sutton 1988) are an important approach to reinforcement learning (RL) that combine ideas from Monte Carlo estimation and dynamic programming.
• The learned values represent answers to questions about how a signal will accumulate over time, conditioned on a way of behaving
• In control tasks, this signal is the reward sequence, and the values represent an arbitrarily long sum of rewards an agent expects to receive when acting greedily with respect to its current predictions.
• Fixed-horizon agents can approximate infinite-horizon returns arbitrarily well, expand their set of learned horizons freely, and combine forecasts from multiple horizons to make time-sensitive predictions about rewards.
• ## Results:

For h = 1, ..., H, the following ODE system has an equilibrium.
• Wh+1,: = E (r(x, a, y) + wh,:φhy − wh+1,:φx)φTx (27).
• Finding an equilibrium point amounts to solving the following equations for all h: wh+1,:E[φxφTx ] = E[(r(x, a, y) + wh,:φhy )φTx ].
• Since the authors assume that the features are linearly independent, and using the fact that w0,: = 0, the authors can recursively solve these equations to find an equilibrium.
• Let the authors be the equilibrium point generated.
• Define w := w − the authors and substitute into Equation (27) to obtain the following system
• ## Conclusion:

Discussion and future work

In this work, the authors investigated using fixed-horizon returns in place of the conventional infinite-horizon return.
• The authors derived FHTD methods and compared them to their infinite-horizon counterparts in terms of prediction capability, complexity, and performance.
• The authors proved convergence of FHTD methods with linear and general function approximation.
• In a tabular control problem, the authors showed that greedifying with respect to estimates of a short, fixed horizon could outperform doing so with respect to longer horizons.
• The authors demonstrated that FHTD methods can scale well to and perform competitively on a deep reinforcement learning control problem
Funding
• We gratefully acknowledge funding from Alberta Innovates – Technology Futures, Google Deepmind, and from the Natural Sciences and Engineering Research Council of Canada
Reference
• [Baird 1995] Baird, L. 1995. Residual algorithms: Reinforcement learning with function approximation. In Prieditis, A., and Russell, S., eds., Machine Learning Proceedings 1995. Morgan Kaufmann. 30 – 37.
• [Benveniste, Metivier, and Priouret 1990] Benveniste, A.; Metivier, M.; and Priouret, P. 1990. Adaptive Algorithms and Stochastic Approximations. Springer-Verlag.
• [Bertsekas 2012] Bertsekas, D. 2012. Dynamic Programming & Optimal Control, Vol II: Approximate Dynamic Programming. Athena Scientific, 4 edition.
• [Bhatnagar et al. 2009] Bhatnagar, S.; Precup, D.; Silver, D.; Sutton, R. S.; Maei, H. R.; and Szepesvari, C. 2009. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, 1204–1212.
• [Boyan and Moore 1995] Boyan, J. A., and Moore, A. W. 199Generalization in reinforcement learning: Safely approximating the value function. In Advances in neural information processing systems, 369–376.
• [Brockman et al. 2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 201OpenAI gym. CoRR abs/1606.01540.
• [Choromanska et al. 2015] Choromanska, A.; Henaff, M.; Mathieu, M.; Arous, G. B.; and LeCun, Y. 2015. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, 192–204.
• [De Asis, Bennett, and Sutton 2019] De Asis, K.; Bennett, B.; and Sutton, R. S. 2019. Extended abstract: Predicting periodicity with temporal difference learning. 4th Multidisciplinary Conference on Reinforcement Learning and Decision Making 108–111.
• [Fedus et al. 2019] Fedus, W.; Gelada, C.; Bengio, Y.; Bellemare, M. G.; and Larochelle, H. 201Hyperbolic discounting and learning over multiple horizons.
• [Ghiassian et al. 2018] Ghiassian, S.; Patterson, A.; White, M.; Sutton, R. S.; and White, A. 2018. Online off-policy prediction. arXiv preprint arXiv:1811.02597.
• [Gordon 1995] Gordon, G. J. 1995. Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995. Elsevier. 261–268.
• [Jaakkola, Jordan, and Singh 1994] Jaakkola, T. S.; Jordan, M. I.; and Singh, S. P. 1994. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6(6):1185–1201.
• [Jaderberg et al. 2016] Jaderberg, M.; Mnih, V.; Czarnecki, W. M.; Schaul, T.; Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2016. Reinforcement learning with unsupervised auxiliary tasks. CoRR abs/1611.05397.
• [Melo and Ribeiro 2007] Melo, F. S., and Ribeiro, M. I. 2007. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, 308– 322. Springer.
• [Meyn and Tweedie 2012] Meyn, S. P., and Tweedie, R. L. 2012. Markov chains and stochastic stability. Springer Science & Business Media.
• [Mnih et al. 2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Humanlevel control through deep reinforcement learning. Nature 518(7540):529–533.
• [Munos and Szepesvari 2008] Munos, R., and Szepesvari, C. 2008. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research 9(May):815–857.
• [Pascanu et al. 2014] Pascanu, R.; Dauphin, Y. N.; Ganguli, S.; and Bengio, Y. 2014. On the saddle point problem for nonconvex optimization. arXiv preprint arXiv:1405.4604.
• [Pennington and Bahri 2017] Pennington, J., and Bahri, Y. 2017. Geometry of neural network loss surfaces via random matrix theory. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2798–2806. JMLR. org.
• [Rubinstein 1981] Rubinstein, R. Y. 1981. Simulation and the Monte Carlo Method. New York, NY, USA: John Wiley & Sons, Inc., 1st edition.
• [Sutton and Barto 2018] Sutton, R. S., and Barto, A. G. 2018. Reinforcement Learning: An Introduction. The MIT Press, 2nd edition.
• [Sutton et al. 2009] Sutton, R. S.; Maei, H. R.; Precup, D.; Bhatnagar, S.; Silver, D.; Szepesvari, C.; and Wiewiora, E. 2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, 993–1000. ACM.
• [Sutton et al. 2011] Sutton, R. S.; Modayil, J.; Delp, M.; Degris, T.; Pilarski, P. M.; White, A.; and Precup, D. 2011. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In AAMAS, 761–768. IFAAMAS.
• [Sutton, Precup, and Singh 1999] Sutton, R. S.; Precup, D.; and Singh, S. P. 1999. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2):181–211.
• [Sutton 1988] Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine learning 3(1):9–44.
• [Tieleman and Hinton 2012] Tieleman, T., and Hinton, G. 2012. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.
• [Tsitsiklis and Van Roy 1997] Tsitsiklis, J., and Van Roy, B. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5):674–690.
• [Tsitsiklis and Van Roy 1999] Tsitsiklis, J. N., and Van Roy, B. 1999. Average cost temporal-difference learning. Automatica 35(11):1799–1808.
• [van Hasselt and Sutton 2015] van Hasselt, H., and Sutton, R. S. 2015. Learning to predict independent of span. CoRR abs/1508.04582.
• [van Seijen, Fatemi, and Tavakoli 2019] van Seijen, H.; Fatemi, H.; and Tavakoli, A. 2019. Using a logarithmic mapping to enable lower discount factors in reinforcement learning.
• [Watkins 1989] Watkins, C. J. C. H. 1989. Learning from Delayed Rewards. Ph.D. Dissertation, King’s College, Cambridge, UK.
• [White 2016] White, M. 2016. Unifying task specification in reinforcement learning. CoRR abs/1609.01995.
• [Zhang et al. 2016] Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; and Vinyals, O. 2016. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
• We assume throughout a common probability space (Ω, P, Σ). Our proof follows the general outline in Melo and Ribeiro (2007).
• Priouret (1990), we must construct another Markov chain so that H(wt, xt) in Benveniste, Metivier, and Priouret, p.213 (1990) has access to the TD error at time t. In the interests of completeness, we provide the full details below, but the reader may safely skip to the next section.
• We employ a variation of a standard approach, as in for example Tsitsiklis and Van Roy (1999). Let us define a new process Mt = (Xt, At, Xt+1, At+1). The process Mt has state space M:= X × A × X × A and σ-algebra σ(F × 2A × F )2A, with kernel Π defined first on M × F × 2A × F
• Proof. From Meyn and Tweedie, p.389 (2012), a Markov chain is uniformly ergodic iff it is νm-small for some m. We will show that Mt is ηm-small for some measure ηm. Since (Xt, At) is uniformly ergodic, let m > 0 and νm a non-trivial measure on F such that for all (x, a) ∈ X × A, B ∈ σ(F × 2A), Pm((x, a), B) ≥ νm(B).
• Note that our Mt corresponds to the Xt in Benveniste, Metivier, and Priouret, p.213 (1990). With this construction finished, we will assume in the following that whenever we refer to X or the Markov chain Xt, we are actually referring to (X × A)2 or the Markov chain Mt respectively.