Stochastic Gradient Descent For Modern Machine Learning: Theory, Algorithms And Applications

user-5f165ac04c775ed682f5819f(2019)

引用 0|浏览33
暂无评分
摘要
Tremendous advances in large scale machine learning and deep learning have been powered by the seemingly simple and lightweight stochastic gradient method. Variants of the stochastic gradient method (based on iterate averaging) are known to be asymptotically optimal (in terms of predictive performance). This thesis examines non-asymptotic issues surrounding the use of stochastic gradient descent (SGD) in practice with an aim to achieve its asymptotically optimal statistical properties. Focusing on the stochastic approximation problem of least squares regression, this thesis considers: 1. Understanding the benefits of tail-averaged SGD, and understanding how SGD's non-asymptotic behavior is influenced when faced with mis-specified problem instances. 2. Understand the parallelization properties of SGD, with a specific focus on mini-batching, model averaging and batch size doubling. Can this characterization shed light on algorithmic regimes (for e.g. largest instance dependent batch sizes) that admit linear parallelization speedups over vanilla SGD (with a batch size 1), thus presenting useful prescriptions that make best use of our hardware resources whilst not being wasteful of computation? As a byproduct of these results, can we understand how the learning rate behaves as a function of the batch size? 3. Similar to how momentum/acceleration schemes such as heavy ball momentum, or Nesterov's acceleration improve over standard batch gradient descent, can we formalize improvements achieved by accelerated methods when working with sampled stochastic gradients? Is there an algorithm that achieves this improvement over …
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要