# The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes. In this work, we show that these common perceptions can be completely false in the early phase of learning. In particular, we fo...More

Code:

Data:

Introduction

- Modern deep learning models are enormously complex function approximators, with many state-of-the-art architectures employing millions or even billions of trainable parameters [Radford et al, 2019, Adiwardana et al, 2020].
- From the empirical perspective, practical models are flexible enough to perfectly fit the training data, even if the labels are pure noise [Zhang et al, 2017].
- These same high-capacity models generalize well when trained on real data, even without any explicit control of capacity.
- The exact notion of simplicity and the mechanism by which it might be achieved remain poorly understood except in certain simplistic settings

Highlights

- Modern deep learning models are enormously complex function approximators, with many state-of-the-art architectures employing millions or even billions of trainable parameters [Radford et al, 2019, Adiwardana et al, 2020]
- To reconcile theory with observation, it has been suggested that deep neural networks may enjoy some form of implicit regularization induced by gradient-based training algorithms that biases the trained models towards simpler functions
- We formally prove that, for a class of well-behaved input distributions, the early-time learning dynamics of gradient descent on a two-layer fully-connected neural network with any common activation can be mimicked by training a simple model of the inputs
- Our result formally proves that neural network and a corresponding linear model make similar predictions in early time, providing a theoretical explanation of their empirical finding
- While we mainly focused on two-layer fully-connected neural networks, we further provided theoretical and empirical evidence suggesting that this phenomenon continues to exist in more complicated models
- Extending our result to those settings is a direction of future work. Another interesting direction is to study the dynamics of neural networks after the initial linear learning phase

Results

- The authors perform experiments on a binary classification task from CIFAR-10 (“cats” vs “horses”) using a multi-layer FC network and a CNN.
- To have finer-grained examination of the evolution of the losses, the authors decompose the residual of the predictions on test data (namely, ft(x) − y for all test data collected as a vector in R2000) onto Vlin, the space spanned by the inputs, and its complement Vl⊥in
- For both networks, the authors observe in Figure 3a that the test losses of the networks and the linear model are almost identical up to 1,000 steps, and the networks start to make progress in Vl⊥in after that.
- The detailed setup and additional results for full-size CIFAR-10 and MNIST are deferred to Appendix A

Conclusion

- This work gave a novel theoretical result rigorously showing that gradient descent on a neural network learns a simple linear function in the early phase.
- Extending the result to those settings is a direction of future work.
- Another interesting direction is to study the dynamics of neural networks after the initial linear learning phase

Summary

## Introduction:

Modern deep learning models are enormously complex function approximators, with many state-of-the-art architectures employing millions or even billions of trainable parameters [Radford et al, 2019, Adiwardana et al, 2020].- From the empirical perspective, practical models are flexible enough to perfectly fit the training data, even if the labels are pure noise [Zhang et al, 2017].
- These same high-capacity models generalize well when trained on real data, even without any explicit control of capacity.
- The exact notion of simplicity and the mechanism by which it might be achieved remain poorly understood except in certain simplistic settings
## Results:

The authors perform experiments on a binary classification task from CIFAR-10 (“cats” vs “horses”) using a multi-layer FC network and a CNN.- To have finer-grained examination of the evolution of the losses, the authors decompose the residual of the predictions on test data (namely, ft(x) − y for all test data collected as a vector in R2000) onto Vlin, the space spanned by the inputs, and its complement Vl⊥in
- For both networks, the authors observe in Figure 3a that the test losses of the networks and the linear model are almost identical up to 1,000 steps, and the networks start to make progress in Vl⊥in after that.
- The detailed setup and additional results for full-size CIFAR-10 and MNIST are deferred to Appendix A
## Conclusion:

This work gave a novel theoretical result rigorously showing that gradient descent on a neural network learns a simple linear function in the early phase.- Extending the result to those settings is a direction of future work.
- Another interesting direction is to study the dynamics of neural networks after the initial linear learning phase

Related work

- The early phase of neural network training has been the focus of considerable recent research. Frankle and Carbin [2019] found that sparse, trainable subnetworks – “lottery tickets" – emerge early in training. Achille et al [2017] showed the importance of early learning from the perspective of creating strong connections that are robust to corruption. Gur-Ari et al [2018] observed that after a short period of training, subsequent gradient updates span a low-dimensional subspace. Li et al [2019a], Lewkowycz et al [2020] showed that an initial large learning rate can benefit late-time generalization performance.

Implicit regularization of (stochastic) gradient descent has also been studied in various settings, suggesting a bias towards large-margin, low-norm, or low-rank solutions [Gunasekar et al, 2017, 2018, Soudry et al, 2018, Li et al, 2018, Ji and Telgarsky, 2019a,b, Arora et al, 2019a, Lyu and Li, 2019, Chizat and Bach, 2020, Razin and Cohen, 2020]. These results mostly aim to characterize the final solutions at convergence, while our focus is on the early-time learning dynamics. Another line of work has identified that deep linear networks gradually increase the rank during training [Arora et al, 2019a, Saxe et al, 2014, Lampinen and Ganguli, 2018, Gidel et al, 2019].

Reference

- Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neural networks. arXiv preprint arXiv:1711.08856, 2017.
- Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020.
- Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pages 7411–7422, 2019a.
- Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019b.
- Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019c.
- Yu Bai and Jason D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkllGyBFPH.
- Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6241–6250, 2017.
- James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu. Towards understanding the spectral bias of deep learning. arXiv preprint arXiv:1912.01198, 2019.
- Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. arXiv preprint arXiv:2002.04486, 2020.
- Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, pages 2933–2943, 2019.
- Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
- Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
- Noureddine El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1–50, 2010.
- Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
- Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in linear neural networks. In Advances in Neural Information Processing Systems, pages 3196–3206, 2019.
- Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151–6159, 2017.
- Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on linear convolutional networks. arXiv preprint arXiv:1806.00468, 2018.
- Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018.
- Wei Hu, Zhiyuan Li, and Dingli Yu. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hke3gyHYwH.
- Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
- Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772–1798, 2019a.
- Ziwei Ji and Matus Jan Telgarsky. Gradient descent aligns the layers of deep linear networks. In 7th International Conference on Learning Representations, ICLR 2019, 2019b.
- Yegor Klochkov and Nikita Zhivotovskiy. Uniform hanson-wright type concentration inequalities for unbounded entries via the entropy method. Electronic Journal of Probability, 25, 2020.
- Andrew K Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks. arXiv preprint arXiv:1809.10374, 2018.
- Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
- Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020.
- Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pages 2–47, 2018.
- Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, pages 11669–11680, 2019a.
- Zhiyuan Li, Ruosong Wang, Dingli Yu, Simon S Du, Wei Hu, Ruslan Salakhutdinov, and Sanjeev Arora. Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809, 2019b.
- Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890, 2019.
- David A McAllester. Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pages 164–170, 1999.
- Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT Press, 2012.
- Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L Edelman, Fred Zhang, and Boaz Barak. Sgd on neural networks learns functions of increasing complexity. arXiv preprint arXiv:1905.11604, 2019.
- Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017a.
- Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017b.
- Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A Alemi, Jascha Sohl-Dickstein, and Samuel S Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. arXiv preprint arXiv:1912.02803, 2019.
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
- Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred A Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. arXiv preprint arXiv:1806.08734, 2018.
- Noam Razin and Nadav Cohen. Implicit regularization in deep learning may not be explainable by norms. arXiv preprint arXiv:2005.06398, 2020.
- Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18, 2013.
- AM Saxe, JL McClelland, and S Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations, 2014.
- Jssai Schur. Bemerkungen zur theorie der beschränkten bilinearformen mit unendlich vielen veränderlichen. Journal für die reine und angewandte Mathematik (Crelles Journal), 1911(140):1–28, 1911.
- Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(70), 2018.
- Lili Su and Pengkun Yang. On learning over-parameterized neural networks: A functional approximation perspective. In Advances in Neural Information Processing Systems, pages 2637–2646, 2019.
- Joel A Tropp. An introduction to matrix concentration inequalities. Foundations and Trends R in Machine Learning, 8(1-2):1–230, 2015.
- VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264–280, 1971.
- Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
- Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pages 5393–5402, 2018.
- Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523, 2019a.
- Springer, 2019b.
- Zhiqin John Xu. Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295, 2018.
- Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
- Greg Yang and Hadi Salman. A fine-grained spectral perspective on neural networks. arXiv preprint arXiv:1907.10599, 2019.
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
- Yaoyu Zhang, Zhi-Qin John Xu, Tao Luo, and Zheng Ma. A type of generalization error induced by initialization in deep neural networks. arXiv preprint arXiv:1905.07777, 2019.

Tags

Comments