The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

NIPS 2020, 2020.

Cited by: 4|Views43
EI
Weibo:
While we mainly focused on two-layer fully-connected neural networks, we further provided theoretical and empirical evidence suggesting that this phenomenon continues to exist in more complicated models

Abstract:

Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes. In this work, we show that these common perceptions can be completely false in the early phase of learning. In particular, we fo...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Modern deep learning models are enormously complex function approximators, with many state-of-the-art architectures employing millions or even billions of trainable parameters [Radford et al, 2019, Adiwardana et al, 2020].
  • From the empirical perspective, practical models are flexible enough to perfectly fit the training data, even if the labels are pure noise [Zhang et al, 2017].
  • These same high-capacity models generalize well when trained on real data, even without any explicit control of capacity.
  • The exact notion of simplicity and the mechanism by which it might be achieved remain poorly understood except in certain simplistic settings
Highlights
  • Modern deep learning models are enormously complex function approximators, with many state-of-the-art architectures employing millions or even billions of trainable parameters [Radford et al, 2019, Adiwardana et al, 2020]
  • To reconcile theory with observation, it has been suggested that deep neural networks may enjoy some form of implicit regularization induced by gradient-based training algorithms that biases the trained models towards simpler functions
  • We formally prove that, for a class of well-behaved input distributions, the early-time learning dynamics of gradient descent on a two-layer fully-connected neural network with any common activation can be mimicked by training a simple model of the inputs
  • Our result formally proves that neural network and a corresponding linear model make similar predictions in early time, providing a theoretical explanation of their empirical finding
  • While we mainly focused on two-layer fully-connected neural networks, we further provided theoretical and empirical evidence suggesting that this phenomenon continues to exist in more complicated models
  • Extending our result to those settings is a direction of future work. Another interesting direction is to study the dynamics of neural networks after the initial linear learning phase
Results
  • The authors perform experiments on a binary classification task from CIFAR-10 (“cats” vs “horses”) using a multi-layer FC network and a CNN.
  • To have finer-grained examination of the evolution of the losses, the authors decompose the residual of the predictions on test data (namely, ft(x) − y for all test data collected as a vector in R2000) onto Vlin, the space spanned by the inputs, and its complement Vl⊥in
  • For both networks, the authors observe in Figure 3a that the test losses of the networks and the linear model are almost identical up to 1,000 steps, and the networks start to make progress in Vl⊥in after that.
  • The detailed setup and additional results for full-size CIFAR-10 and MNIST are deferred to Appendix A
Conclusion
  • This work gave a novel theoretical result rigorously showing that gradient descent on a neural network learns a simple linear function in the early phase.
  • Extending the result to those settings is a direction of future work.
  • Another interesting direction is to study the dynamics of neural networks after the initial linear learning phase
Summary
  • Introduction:

    Modern deep learning models are enormously complex function approximators, with many state-of-the-art architectures employing millions or even billions of trainable parameters [Radford et al, 2019, Adiwardana et al, 2020].
  • From the empirical perspective, practical models are flexible enough to perfectly fit the training data, even if the labels are pure noise [Zhang et al, 2017].
  • These same high-capacity models generalize well when trained on real data, even without any explicit control of capacity.
  • The exact notion of simplicity and the mechanism by which it might be achieved remain poorly understood except in certain simplistic settings
  • Results:

    The authors perform experiments on a binary classification task from CIFAR-10 (“cats” vs “horses”) using a multi-layer FC network and a CNN.
  • To have finer-grained examination of the evolution of the losses, the authors decompose the residual of the predictions on test data (namely, ft(x) − y for all test data collected as a vector in R2000) onto Vlin, the space spanned by the inputs, and its complement Vl⊥in
  • For both networks, the authors observe in Figure 3a that the test losses of the networks and the linear model are almost identical up to 1,000 steps, and the networks start to make progress in Vl⊥in after that.
  • The detailed setup and additional results for full-size CIFAR-10 and MNIST are deferred to Appendix A
  • Conclusion:

    This work gave a novel theoretical result rigorously showing that gradient descent on a neural network learns a simple linear function in the early phase.
  • Extending the result to those settings is a direction of future work.
  • Another interesting direction is to study the dynamics of neural networks after the initial linear learning phase
Related work
Reference
  • Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neural networks. arXiv preprint arXiv:1711.08856, 2017.
    Findings
  • Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020.
    Findings
  • Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pages 7411–7422, 2019a.
    Google ScholarLocate open access versionFindings
  • Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019b.
    Findings
  • Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019c.
    Findings
  • Yu Bai and Jason D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkllGyBFPH.
    Locate open access versionFindings
  • Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
    Google ScholarLocate open access versionFindings
  • Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6241–6250, 2017.
    Google ScholarLocate open access versionFindings
  • James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
    Findings
  • Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu. Towards understanding the spectral bias of deep learning. arXiv preprint arXiv:1912.01198, 2019.
    Findings
  • Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. arXiv preprint arXiv:2002.04486, 2020.
    Findings
  • Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, pages 2933–2943, 2019.
    Google ScholarLocate open access versionFindings
  • Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
    Findings
  • Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
    Findings
  • Noureddine El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1–50, 2010.
    Google ScholarLocate open access versionFindings
  • Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
    Locate open access versionFindings
  • Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in linear neural networks. In Advances in Neural Information Processing Systems, pages 3196–3206, 2019.
    Google ScholarLocate open access versionFindings
  • Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151–6159, 2017.
    Google ScholarLocate open access versionFindings
  • Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on linear convolutional networks. arXiv preprint arXiv:1806.00468, 2018.
    Findings
  • Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018.
    Findings
  • Wei Hu, Zhiyuan Li, and Dingli Yu. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hke3gyHYwH.
    Locate open access versionFindings
  • Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
    Findings
  • Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772–1798, 2019a.
    Google ScholarLocate open access versionFindings
  • Ziwei Ji and Matus Jan Telgarsky. Gradient descent aligns the layers of deep linear networks. In 7th International Conference on Learning Representations, ICLR 2019, 2019b.
    Google ScholarLocate open access versionFindings
  • Yegor Klochkov and Nikita Zhivotovskiy. Uniform hanson-wright type concentration inequalities for unbounded entries via the entropy method. Electronic Journal of Probability, 25, 2020.
    Google ScholarLocate open access versionFindings
  • Andrew K Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks. arXiv preprint arXiv:1809.10374, 2018.
    Findings
  • Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
    Findings
  • Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020.
    Findings
  • Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pages 2–47, 2018.
    Google ScholarLocate open access versionFindings
  • Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, pages 11669–11680, 2019a.
    Google ScholarLocate open access versionFindings
  • Zhiyuan Li, Ruosong Wang, Dingli Yu, Simon S Du, Wei Hu, Ruslan Salakhutdinov, and Sanjeev Arora. Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809, 2019b.
    Findings
  • Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890, 2019.
    Findings
  • David A McAllester. Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pages 164–170, 1999.
    Google ScholarLocate open access versionFindings
  • Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT Press, 2012.
    Google ScholarFindings
  • Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L Edelman, Fred Zhang, and Boaz Barak. Sgd on neural networks learns functions of increasing complexity. arXiv preprint arXiv:1905.11604, 2019.
    Findings
  • Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017a.
    Findings
  • Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017b.
    Google ScholarLocate open access versionFindings
  • Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A Alemi, Jascha Sohl-Dickstein, and Samuel S Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. arXiv preprint arXiv:1912.02803, 2019.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
    Google ScholarLocate open access versionFindings
  • Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred A Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. arXiv preprint arXiv:1806.08734, 2018.
    Findings
  • Noam Razin and Nadav Cohen. Implicit regularization in deep learning may not be explainable by norms. arXiv preprint arXiv:2005.06398, 2020.
    Findings
  • Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18, 2013.
    Google ScholarLocate open access versionFindings
  • AM Saxe, JL McClelland, and S Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations, 2014.
    Google ScholarLocate open access versionFindings
  • Jssai Schur. Bemerkungen zur theorie der beschränkten bilinearformen mit unendlich vielen veränderlichen. Journal für die reine und angewandte Mathematik (Crelles Journal), 1911(140):1–28, 1911.
    Google ScholarLocate open access versionFindings
  • Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(70), 2018.
    Google ScholarLocate open access versionFindings
  • Lili Su and Pengkun Yang. On learning over-parameterized neural networks: A functional approximation perspective. In Advances in Neural Information Processing Systems, pages 2637–2646, 2019.
    Google ScholarLocate open access versionFindings
  • Joel A Tropp. An introduction to matrix concentration inequalities. Foundations and Trends R in Machine Learning, 8(1-2):1–230, 2015.
    Google ScholarLocate open access versionFindings
  • VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264–280, 1971.
    Google ScholarLocate open access versionFindings
  • Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
    Google ScholarFindings
  • Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pages 5393–5402, 2018.
    Google ScholarLocate open access versionFindings
  • Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523, 2019a.
    Findings
  • Springer, 2019b.
    Google ScholarFindings
  • Zhiqin John Xu. Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295, 2018.
    Findings
  • Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
    Findings
  • Greg Yang and Hadi Salman. A fine-grained spectral perspective on neural networks. arXiv preprint arXiv:1907.10599, 2019.
    Findings
  • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • Yaoyu Zhang, Zhi-Qin John Xu, Tao Luo, and Zheng Ma. A type of generalization error induced by initialization in deep neural networks. arXiv preprint arXiv:1905.07777, 2019.
    Findings
Your rating :
0

 

Tags
Comments