# Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms

NIPS 2020, 2020.

EI

Keywords:

Weibo:

Abstract:

We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain. We establish sharp information theoretic minimax lower bounds for this problem in terms of $\tau_{\mathsf{mix}}$, the mixing time of the underlying Markov chain, under different noise settings. Our results estab...More

Code:

Data:

Introduction

- Typical machine learning algorithms and their analyses crucially require the training data to be sampled independently and identically (i.i.d.).
- Popular schemes to break temporal correlations in the input datapoints that have been shown to work well in practice, such as experience replay, are wholly lacking in theoretical analysis.
- The authors' work is comprehensive in its treatment of this important problem and in particular, the authors offer the first theoretical analysis for experience replay in a structured Markovian setting, an idea that is widely adopted in practice for modern deep RL

Highlights

- Typical machine learning algorithms and their analyses crucially require the training data to be sampled independently and identically (i.i.d.)
- Our work is comprehensive in its treatment of this important problem and in particular, we offer the first theoretical analysis for experience replay in a structured Markovian setting, an idea that is widely adopted in practice for modern deep Reinforcement Learning (RL)
- The error achieved by any Stochastic Gradient Descent (SGD) type procedure can be decomposed as a sum of two terms: bias and variance where the bias part depends on step size α and w1 − w∗ 2 and the variance depends on σ2, where w1 is the starting iterate of the SGD procedure
- We obtain the fundamental limits of performance/minimax rates that are achievable in linear least squares regression problem with Markov chain data
- In the general agnostic noise setting, we show that any algorithm suffers by a factor of τmix in both bias and variance, compared to the i.i.d. setting
- In the independent noise setting, the minimax rate for variance can be improved to match that of the i.i.d. setting but standard SGD method with constant step size still suffers from a worse rate

Results

- The authors are interested in understanding the limits of SGD type algorithms, with constant step sizes, for solving (2).
- These algorithms are, by far, the most widely used methods in practice for two reasons: 1) these methods are memory efficient, and 2) constant step size allows decreasing the error rapidly in the beginning stages and is crucial for good convergence.
- The variance term arises because the gradients are stochastic and even if the authors initialize the algorithm at w∗, the stochastic gradients are nonzero.

Conclusion

- The authors obtain the fundamental limits of performance/minimax rates that are achievable in linear least squares regression problem with Markov chain data.
- In the general agnostic noise setting, the authors show that any algorithm suffers by a factor of τmix in both bias and variance, compared to the i.i.d. setting.
- In the independent noise setting, the minimax rate for variance can be improved to match that of the i.i.d. setting but standard SGD method with constant step size still suffers from a worse rate.
- The authors' results suggest that instead of considering the general class of optimization problems with arbitrary Markov chain data, it may be useful to identify and focus on important special cases of Markovian data, where novel algorithms with nontrivial improvements might be possible

Summary

## Introduction:

Typical machine learning algorithms and their analyses crucially require the training data to be sampled independently and identically (i.i.d.).- Popular schemes to break temporal correlations in the input datapoints that have been shown to work well in practice, such as experience replay, are wholly lacking in theoretical analysis.
- The authors' work is comprehensive in its treatment of this important problem and in particular, the authors offer the first theoretical analysis for experience replay in a structured Markovian setting, an idea that is widely adopted in practice for modern deep RL
## Objectives:

Given samples (X1, Y1), · · · , (XT , YT ), the goal is to estimate a parameter w∗ that minimizes the out-ofsample loss, which is the expected loss on a new sample (X, T ) where X is drawn independently from the stationary distribution π of MC:.## Results:

The authors are interested in understanding the limits of SGD type algorithms, with constant step sizes, for solving (2).- These algorithms are, by far, the most widely used methods in practice for two reasons: 1) these methods are memory efficient, and 2) constant step size allows decreasing the error rapidly in the beginning stages and is crucial for good convergence.
- The variance term arises because the gradients are stochastic and even if the authors initialize the algorithm at w∗, the stochastic gradients are nonzero.
## Conclusion:

The authors obtain the fundamental limits of performance/minimax rates that are achievable in linear least squares regression problem with Markov chain data.- In the general agnostic noise setting, the authors show that any algorithm suffers by a factor of τmix in both bias and variance, compared to the i.i.d. setting.
- In the independent noise setting, the minimax rate for variance can be improved to match that of the i.i.d. setting but standard SGD method with constant step size still suffers from a worse rate.
- The authors' results suggest that instead of considering the general class of optimization problems with arbitrary Markov chain data, it may be useful to identify and focus on important special cases of Markovian data, where novel algorithms with nontrivial improvements might be possible

- Table1: See Section 2 for a description of the three settings considered in this paper. We suppress universal constants and log factors in the expressions above. For linear regression with i.i.d. data, tail-averaged SGD

Reference

- John N Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems, pages 1075–1081, 1997.
- H. Kushner and G.G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Stochastic Modelling and Applied Probability. Springer New York, 2003.
- Abdelkader Mokkadem. Mixing properties of ARMA processes. Stochastic Processes and their Applications, 29(2):309 – 315, 1988.
- John C. Duchi, Alekh Agarwal, Mikael Johansson, and Michael I. Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549–1578, 2012.
- Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. In Conference On Learning Theory, pages 1691– 1692, 2018.
- R Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD learning. In Conference on Learning Theory, pages 2803–2830, 2019.
- Constantinos Daskalakis, Nishanth Dikkala, and Ioannis Panageas. Regression from dependent observations. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 881–889, 2019.
- Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, and Siddhartha Jayanti. Learning from weakly dependent data under dobrushin’s condition. In Conference on Learning Theory, pages 914–928, 2019.
- Ratnadip Adhikari and Ramesh K Agrawal. An introductory study on time series modeling and forecasting. arXiv preprint arXiv:1302.6613, 2013.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
- Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in neural information processing systems, pages 5048–5058, 2017.
- Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
- David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
- Prateek Jain, Sham Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. Journal of machine learning research, 18, 2018.
- Jaouad Mourtada. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. arXiv preprint arXiv:1912.10754, 2019.
- Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Rajendra Bhatia. Matrix Analysis, volume 169.
- Imre Csiszár and Zsolt Talata. Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Transactions on Information theory, 52(3):1007–1016, 2006.
- Daniel Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Probability, 20, 2015.
- 1. This follows from the fact that w∗ is the minimizer of the square loss L(w) and hence ∇L(w∗) = 0.
- 2. Clearly, L(w) = w Aw + Ex∼πE|Y0(x)|2 − 2Ex∼πEY0(x)x w. The result follows after a simple algebraic manipulation involving item 1 above.
- 2. The following lower bound holds for the LHS of Equation (4): L(Q)
- 1. The lower bounds for this case follows using similar reasoning as above. We conclude that even when 1 ≤ τ0 < 2C, Equation (13) holds.
- 2. Every entry of wtv,air is uncorrelated with every entry of wsb,ijas for every t, i, s, j. 3. Every entry of wtv,air is uncorrelated with every entry of wsv,ajr for every t, s when i = j
- 1. This follows from Equation 41 and the fact that l,i are mean 0 random variables independent of al,i.
- 2. This follows from the fact that t are i.i.d. mean 0 and independent of the Markov chain.
- 3. The proof is similar to the proof of item 2.
- 2. For every random vector X ∈ Ft−2,i such that E X 2 < ∞. Let α < 1 and T r > 2κ. we have: E||Γ(t,it)−1X||2 ≤
- 2. By induction, we conclude
- 2. If A, B ∈ S(d) such that B A then Λ(B) Λ(A).
- 3. Let M be a PSD operator. Then Λ(M ) 2 ≤ M 2 and in particular Λ(I) I.
- 1. The proof follows from the definition of PSD matrices and PSD maps.
- 3. This follows easily from the definition of Λ and submultiplicativity of operator norm and the fact that ax ≤ 1 almost surely.
- 1. The step size α in this work corresponds to γ.
- 3. The matrix σ2A here corresponds Σ.
- 4. The matrix A here corresponds to H.

Tags

Comments