# Geometric Insights into the Convergence of Nonlinear TD Learning

ICLR, 2020.

EI

Keywords:

TD nonlinear convergence value estimation reinforcement learning

Weibo:

Abstract:

While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples are known. Here we take a first step towards extending theoretical convergence guarantees to TD learning with nonlinear function approximat...More

Introduction

- The instability of reinforcement learning (RL) algorithms is well known, but not well characterized theoretically.
- To further simplify the analysis, the authors only consider the expected learning dynamics in continuous time as opposed to the online algorithm with sampling
- This means that the authors are eschewing discussions of off-policy data, exploration, sampling variance, and step size.
- The convergence of this ODE is known in two regimes: under linear function approximation for general environments (Tsitsiklis & Van Roy, 1997) and under reversible environments for general function approximation (Ollivier, 2018)
- The authors significantly close this gap through the following contributions: 1.
- When the authors use a parametrization inspired by ResNets, nonlinear TD will have error comparable to linear TD in the worst case

Highlights

- The instability of reinforcement learning (RL) algorithms is well known, but not well characterized theoretically
- Since the dynamics of temporal difference do not follow the gradient of any objective function, the interaction of the geometry of the function class with that of the temporal difference algorithm in the space of all functions potentially eliminates any convergence guarantees
- We prove global convergence to the true value function when the environment is “more reversible” than the function approximator is “poorly conditioned”. 3
- We have considered the expected continuous dynamics of the temporal difference algorithm for on policy value estimation from the perspective of the interaction of the geometry of the function approximator and environment
- The worst case solution in this set is comparable to the worst case linear temporal difference solution for a particular parametrization inspired by ResNets
- We showed global convergence when the environment is more reversible than the approximator is poorly-conditioned

Conclusion

- The authors have considered the expected continuous dynamics of the TD algorithm for on policy value estimation from the perspective of the interaction of the geometry of the function approximator and environment.
- Using this perspective the authors derived two positive results and one negative result.
- The authors provided a generalized counterexample to motivate the assumptions necessary to rule out bad interactions between approximator and environment

Summary

## Introduction:

The instability of reinforcement learning (RL) algorithms is well known, but not well characterized theoretically.- To further simplify the analysis, the authors only consider the expected learning dynamics in continuous time as opposed to the online algorithm with sampling
- This means that the authors are eschewing discussions of off-policy data, exploration, sampling variance, and step size.
- The convergence of this ODE is known in two regimes: under linear function approximation for general environments (Tsitsiklis & Van Roy, 1997) and under reversible environments for general function approximation (Ollivier, 2018)
- The authors significantly close this gap through the following contributions: 1.
- When the authors use a parametrization inspired by ResNets, nonlinear TD will have error comparable to linear TD in the worst case
## Conclusion:

The authors have considered the expected continuous dynamics of the TD algorithm for on policy value estimation from the perspective of the interaction of the geometry of the function approximator and environment.- Using this perspective the authors derived two positive results and one negative result.
- The authors provided a generalized counterexample to motivate the assumptions necessary to rule out bad interactions between approximator and environment

Related work

- 6.1 CONNECTIONS TO WORK IN THE LAZY TRAINING REGIME

Concurrent work (Agazzi & Lu, 2019) has proven convergence of expected TD in the nonlinear, non-reversible setting in the so-called “lazy training” regime, in which nonlinear models (including neural networks) with particular parametrization and scaling behave as linear models, with a kernel given by the linear approximation of the function at initialization. Whereas this kernel captures some structure from the function approximation, the lazy training regime does not account for feature selection, since parameters are confined in a small neighborhood around their initialization (Chizat & Bach, 2018). Another result in a similar direction is from concurrent work (Cai et al, 2019) which considers two-layer networks (one hidden layer) in the large width regime where only the first layer is trained. They show that this particular type of function with fixed output layer is nearly linear and derive global convergence in the limit of large width with an additional assumption on the regularity of the stationary distribution. In contrast with these works, our results account for feature selection with more general nonlinear functions. Our homogeneous results hold for a broad class of approximators much closer to those used in practice and our well-conditioned results hold for general nonlinear parametrization and provide useful intuition about the relationship between approximator and environment.

Funding

- We also thank the lab mates, especially Will Whitney, Aaron Zweig, and Min Jae Song, who provided useful discussions and feedback. This work was partially supported by the Alfred P

Reference

- Joshua Achiam, Ethan Knight, and Pieter Abbeel. Towards Characterizing Divergence in Deep Q-Learning. arXiv e-prints, art. arXiv:1903.08894, Mar 2019.
- Andrea Agazzi and Jianfeng Lu. Temporal-difference learning for nonlinear value function approximation in the lazy training regime. CoRR, abs/1905.10917, 2019. URL http://arxiv.org/abs/1905.10917.
- Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pp. 30–37.
- Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450, 2018.
- Shalabh Bhatnagar, Doina Precup, David Silver, Richard S Sutton, Hamid R. Maei, and Csaba Szepesvari. Convergent temporal-difference learning with arbitrary smooth function approximation. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (eds.), Advances in Neural Information Processing Systems 22, pp. 1204–1212. Curran Associates, Inc., 2009.
- Vivek S. Borkar. Stochastic approximation with two time scales. Syst. Control Lett., 29(5):291– 294, February 1997. ISSN 0167-6911. doi: 10.1016/S0167-6911(97)90015-3. URL http://dx.doi.org/10.1016/S0167-6911(97)90015-3.
- V.S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008. ISBN 9780521515924. URL https://books.google.com/books?id= QLxIvgAACAAJ.
- Qi Cai, Zhuoran Yang, Jason D. Lee, and Zhaoran Wang. Neural temporal-difference learning converges to global optima. CoRR, abs/1905.10027, 2019. URL http://arxiv.org/abs/1905.10027.
- Lenaıc Chizat and Francis Bach. A Note on Lazy Training in Supervised Differentiable Programming. working paper or preprint, December 2018. URL https://hal.inria.fr/hal-01945578.
- Wesley Chung, Somjit Nath, Ajin Joseph, and Martha White. Two-timescale networks for nonlinear value function approximation. ICLR, 2019.
- Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. SBEED: Convergent reinforcement learning with nonlinear function approximation. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1125–1134, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/dai18c.html.
- Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine. Diagnosing Bottlenecks in Deep Qlearning Algorithms. arXiv e-prints, art. arXiv:1902.10250, Feb 2019.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Roger A. Horn and Charles R. Johnson. Topics in matrix analysis. Cambridge University Press, Cambridge, 1994. ISBN 0-521-46713-6. Corrected reprint of the 1991 original.
- Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571– 8580, 2018.
- Tengyuan Liang, Tomaso A. Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. CoRR, abs/1711.01530, 2017. URL http://arxiv.org/abs/1711.01530.
- Hamid R. Maei. Gradient temporal-difference learning algorithms. University of Alberta, 2011.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015. ISSN 00280836. URL http://dx.doi.org/10.1038/nature14236.
- Remi Munos. Performance bounds in lp-norm for approximate value iteration. SIAM J. Control and Optimization, 46(2):541–561, 2007.
- Remi Munos and Csaba Szepesvari. Finite-time bounds for fitted value iteration. J. Mach. Learn. Res., 9:815–857, June 2008. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1390681.1390708.
- Yann Ollivier. Approximate temporal difference learning is a gradient descent for reversible policies. CoRR, abs/1805.00869, 2018. URL http://arxiv.org/abs/1805.00869.
- Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674, 2019.
- H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
- Ohad Shamir. Are resnets provably better than linear predictors? In Advances in neural information processing systems, pp. 507–516, 2018.
- Richard S. Sutton. Learning to predict by the methods of temporal differences. Mach. Learn., 3 (1):9–44, August 1988. ISSN 0885-61doi: 10.1023/A:1022633531479. URL http://dx.doi.org/10.1023/A:1022633531479.
- Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 2nd edition, 2018.
- Published as a conference paper at ICLR 2020 John N. Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. In M. C. Mozer, M. I. Jordan, and T. Petsche (eds.), Advances in Neural Information Processing Systems 9, pp. 1075–1081. MIT Press, 1997. Zhuoran Yang, Zuyue Fu, Kaiqing Zhang, and Zhaoran Wang. Convergent reinforcement learning with function approximation: A bilevel optimization perspective, 2019a. URL https://openreview.net/forum?id=ryfcCo0ctQ. Zhuoran Yang, Yuchen Xie, and Zhaoran Wang. A theoretical analysis of deep q-learning. CoRR, abs/1901.00137, 2019b. URL http://arxiv.org/abs/1901.00137.
- = V (θ)T AV (θ) − c for c = (1 − γ)(B + ) > 0. The last inequality follows from an application of Lemma 1 of (Tsitsiklis & Van Roy, 1997) that P V μ ≤ V μ along with Cauchy-Schwarz to show that: V (θ)T DμP V (θ) = V (θ)T Dμ1/2Dμ1/2P V (θ) ≤ V (θ) μ P V (θ) μ ≤ V (θ) 2μ.

Tags

Comments