Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views24|Links
EI
Keywords:
constant step sizesquare regressionSGD-Data Dropnoise settingSGD with Experience ReplayMore(8+)
Weibo:
We obtain the fundamental limits of performance/minimax rates that are achievable in linear least squares regression problem with Markov chain data

Abstract:

We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain. We establish sharp information theoretic minimax lower bounds for this problem in terms of $\tau_{\mathsf{mix}}$, the mixing time of the underlying Markov chain, under different noise settings. Our results estab...More

Code:

Data:

0
Introduction
  • Typical machine learning algorithms and their analyses crucially require the training data to be sampled independently and identically (i.i.d.).
  • Popular schemes to break temporal correlations in the input datapoints that have been shown to work well in practice, such as experience replay, are wholly lacking in theoretical analysis.
  • The authors' work is comprehensive in its treatment of this important problem and in particular, the authors offer the first theoretical analysis for experience replay in a structured Markovian setting, an idea that is widely adopted in practice for modern deep RL
Highlights
  • Typical machine learning algorithms and their analyses crucially require the training data to be sampled independently and identically (i.i.d.)
  • Our work is comprehensive in its treatment of this important problem and in particular, we offer the first theoretical analysis for experience replay in a structured Markovian setting, an idea that is widely adopted in practice for modern deep Reinforcement Learning (RL)
  • The error achieved by any Stochastic Gradient Descent (SGD) type procedure can be decomposed as a sum of two terms: bias and variance where the bias part depends on step size α and w1 − w∗ 2 and the variance depends on σ2, where w1 is the starting iterate of the SGD procedure
  • We obtain the fundamental limits of performance/minimax rates that are achievable in linear least squares regression problem with Markov chain data
  • In the general agnostic noise setting, we show that any algorithm suffers by a factor of τmix in both bias and variance, compared to the i.i.d. setting
  • In the independent noise setting, the minimax rate for variance can be improved to match that of the i.i.d. setting but standard SGD method with constant step size still suffers from a worse rate
Results
  • The authors are interested in understanding the limits of SGD type algorithms, with constant step sizes, for solving (2).
  • These algorithms are, by far, the most widely used methods in practice for two reasons: 1) these methods are memory efficient, and 2) constant step size allows decreasing the error rapidly in the beginning stages and is crucial for good convergence.
  • The variance term arises because the gradients are stochastic and even if the authors initialize the algorithm at w∗, the stochastic gradients are nonzero.
Conclusion
  • The authors obtain the fundamental limits of performance/minimax rates that are achievable in linear least squares regression problem with Markov chain data.
  • In the general agnostic noise setting, the authors show that any algorithm suffers by a factor of τmix in both bias and variance, compared to the i.i.d. setting.
  • In the independent noise setting, the minimax rate for variance can be improved to match that of the i.i.d. setting but standard SGD method with constant step size still suffers from a worse rate.
  • The authors' results suggest that instead of considering the general class of optimization problems with arbitrary Markov chain data, it may be useful to identify and focus on important special cases of Markovian data, where novel algorithms with nontrivial improvements might be possible
Summary
  • Introduction:

    Typical machine learning algorithms and their analyses crucially require the training data to be sampled independently and identically (i.i.d.).
  • Popular schemes to break temporal correlations in the input datapoints that have been shown to work well in practice, such as experience replay, are wholly lacking in theoretical analysis.
  • The authors' work is comprehensive in its treatment of this important problem and in particular, the authors offer the first theoretical analysis for experience replay in a structured Markovian setting, an idea that is widely adopted in practice for modern deep RL
  • Objectives:

    Given samples (X1, Y1), · · · , (XT , YT ), the goal is to estimate a parameter w∗ that minimizes the out-ofsample loss, which is the expected loss on a new sample (X, T ) where X is drawn independently from the stationary distribution π of MC:.
  • Results:

    The authors are interested in understanding the limits of SGD type algorithms, with constant step sizes, for solving (2).
  • These algorithms are, by far, the most widely used methods in practice for two reasons: 1) these methods are memory efficient, and 2) constant step size allows decreasing the error rapidly in the beginning stages and is crucial for good convergence.
  • The variance term arises because the gradients are stochastic and even if the authors initialize the algorithm at w∗, the stochastic gradients are nonzero.
  • Conclusion:

    The authors obtain the fundamental limits of performance/minimax rates that are achievable in linear least squares regression problem with Markov chain data.
  • In the general agnostic noise setting, the authors show that any algorithm suffers by a factor of τmix in both bias and variance, compared to the i.i.d. setting.
  • In the independent noise setting, the minimax rate for variance can be improved to match that of the i.i.d. setting but standard SGD method with constant step size still suffers from a worse rate.
  • The authors' results suggest that instead of considering the general class of optimization problems with arbitrary Markov chain data, it may be useful to identify and focus on important special cases of Markovian data, where novel algorithms with nontrivial improvements might be possible
Tables
  • Table1: See Section 2 for a description of the three settings considered in this paper. We suppress universal constants and log factors in the expressions above. For linear regression with i.i.d. data, tail-averaged SGD
Download tables as Excel
Reference
  • John N Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems, pages 1075–1081, 1997.
    Google ScholarLocate open access versionFindings
  • H. Kushner and G.G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Stochastic Modelling and Applied Probability. Springer New York, 2003.
    Google ScholarFindings
  • Abdelkader Mokkadem. Mixing properties of ARMA processes. Stochastic Processes and their Applications, 29(2):309 – 315, 1988.
    Google ScholarLocate open access versionFindings
  • John C. Duchi, Alekh Agarwal, Mikael Johansson, and Michael I. Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549–1578, 2012.
    Google ScholarLocate open access versionFindings
  • Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. In Conference On Learning Theory, pages 1691– 1692, 2018.
    Google ScholarLocate open access versionFindings
  • R Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD learning. In Conference on Learning Theory, pages 2803–2830, 2019.
    Google ScholarLocate open access versionFindings
  • Constantinos Daskalakis, Nishanth Dikkala, and Ioannis Panageas. Regression from dependent observations. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 881–889, 2019.
    Google ScholarLocate open access versionFindings
  • Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, and Siddhartha Jayanti. Learning from weakly dependent data under dobrushin’s condition. In Conference on Learning Theory, pages 914–928, 2019.
    Google ScholarLocate open access versionFindings
  • Ratnadip Adhikari and Ramesh K Agrawal. An introductory study on time series modeling and forecasting. arXiv preprint arXiv:1302.6613, 2013.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
    Findings
  • Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in neural information processing systems, pages 5048–5058, 2017.
    Google ScholarLocate open access versionFindings
  • Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
    Google ScholarLocate open access versionFindings
  • David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
    Google ScholarFindings
  • Prateek Jain, Sham Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. Journal of machine learning research, 18, 2018.
    Google ScholarLocate open access versionFindings
  • Jaouad Mourtada. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. arXiv preprint arXiv:1912.10754, 2019.
    Findings
  • Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • Rajendra Bhatia. Matrix Analysis, volume 169.
    Google ScholarLocate open access versionFindings
  • Imre Csiszár and Zsolt Talata. Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Transactions on Information theory, 52(3):1007–1016, 2006.
    Google ScholarLocate open access versionFindings
  • Daniel Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Probability, 20, 2015.
    Google ScholarLocate open access versionFindings
  • 1. This follows from the fact that w∗ is the minimizer of the square loss L(w) and hence ∇L(w∗) = 0.
    Google ScholarFindings
  • 2. Clearly, L(w) = w Aw + Ex∼πE|Y0(x)|2 − 2Ex∼πEY0(x)x w. The result follows after a simple algebraic manipulation involving item 1 above.
    Google ScholarFindings
  • 2. The following lower bound holds for the LHS of Equation (4): L(Q)
    Google ScholarFindings
  • 1. The lower bounds for this case follows using similar reasoning as above. We conclude that even when 1 ≤ τ0 < 2C, Equation (13) holds.
    Google ScholarFindings
  • 2. Every entry of wtv,air is uncorrelated with every entry of wsb,ijas for every t, i, s, j. 3. Every entry of wtv,air is uncorrelated with every entry of wsv,ajr for every t, s when i = j
    Google ScholarFindings
  • 1. This follows from Equation 41 and the fact that l,i are mean 0 random variables independent of al,i.
    Google ScholarLocate open access versionFindings
  • 2. This follows from the fact that t are i.i.d. mean 0 and independent of the Markov chain.
    Google ScholarFindings
  • 3. The proof is similar to the proof of item 2.
    Google ScholarFindings
  • 2. For every random vector X ∈ Ft−2,i such that E X 2 < ∞. Let α < 1 and T r > 2κ. we have: E||Γ(t,it)−1X||2 ≤
    Google ScholarFindings
  • 2. By induction, we conclude
    Google ScholarFindings
  • 2. If A, B ∈ S(d) such that B A then Λ(B) Λ(A).
    Google ScholarFindings
  • 3. Let M be a PSD operator. Then Λ(M ) 2 ≤ M 2 and in particular Λ(I) I.
    Google ScholarFindings
  • 1. The proof follows from the definition of PSD matrices and PSD maps.
    Google ScholarFindings
  • 3. This follows easily from the definition of Λ and submultiplicativity of operator norm and the fact that ax ≤ 1 almost surely.
    Google ScholarFindings
  • 1. The step size α in this work corresponds to γ.
    Google ScholarFindings
  • 3. The matrix σ2A here corresponds Σ.
    Google ScholarFindings
  • 4. The matrix A here corresponds to H.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments