# Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

COLT, pp. 2388-2464, 2019.

EI

Keywords:

particle dynamicsnonlinear dynamics and particle dynamics Proposition 26local minimumnonlinear dynamicsmulti layerMore(12+)

Wei bo:

Abstract:

We consider learning two layer neural networks using stochastic gradient descent. The mean-field description of this learning dynamics approximates the evolution of the network weights by an evolution in the space of probability distributions in $R^D$ (where $D$ is the number of parameters associated to each neuron). This evolution can be...More

Code:

Data:

Introduction

- Multi-layer neural networks, and in particular multi-layer perceptrons, present a number of remarkable features.
- Mean-field theory allowed them to prove global convergence guarantees for SGD in two-layers neural networks Mei et al (2018); Chizat and Bach (2018a).
- Mei et al (2018) proves quantitative bounds to approximate SGD by the mean-field dynamics.

Highlights

- Multi-layer neural networks, and in particular multi-layer perceptrons, present a number of remarkable features. They are effectively trained using stochastic-gradient descent (SGD) LeCun et al (1998); their behavior is fairly insensitive to the number of hidden units or to the input dimensions Srivastava et al (2014); their number of parameters is often larger than the number of samples
- We prove a new bound that is dimension independent and more natural: keeping the evolution time T = O(1), the new results requires N 1 in order to get a vanishing approximation error ( to make the approximation error vanish, N should depend on the Lipschitz constants in the assumptions which may implicitly depend on dimension
- For a suitable scaling of the initialization, kernel and mean field regimes appear at different time scales
- We introduce the following metric on C([0, T ]; PC0,T (RD)): DT (m1, m2) = inf sup θ1t − θ2t 22γ : γ is a coupling of m1, m2
- We introduce the distributional dynamics and residual dynamics, which we consider in the pre-limit and in the limit of infinite number of neurons

Results

- Theorem 4 (B) is the first quantitative bound approximating noisy SGD by the distributional dynamics, for the case of unbounded coefficients.
- Bound between PDE and nonlinear dynamics Proposition 13 (PDE-ND) There exists a constant K depending only on the Ki, i = 1, 2, 3, such that with probability at least 1 − e−z2, the authors have sup
- Bound between particle dynamics and GD Proposition 18 (PD-GD) There exists a constant K such that: sup max k∈[0,t/ε]∩N i≤N
- Proposition 19 (GD-SGD) There exists a constant K, such that with probability at least 1 − e−z2, the authors have sup max k∈[0,T /ε]∩N i∈[N ]
- Proposition 23 (PDE-ND) There exists a constant K, such that with probability at least 1 − e−z2, the authors have sup |RN − R| ≤ K(1 + T )4 √1 [ log(N T ) + z]
- Lemma 25 (Term the author bound) There exists a constant K, such that sup |RN − ERN| ≤ K(1 + T )4[
- Proposition 29 (GD-SGD) There exists constants K and K0, such that if the authors take ε ≤ 1/[K0(D + log N + z2)eK0(1+T )3], the following holds with probability at least 1 − e−z2: for any t ≤ T , the authors have sup k∈[0,t/ε]∩N
- Proof [Proof of Lemma 30] Let them first consider a generic D-dimensional K2-sub-Gaussian random vector X, the authors have: EX [exp{μ

Conclusion

- Taking the union bound over i ∈ [N ] gives: P max sup Wi(t) 2 ≥ u ≤ (1 − 2μτ T /D)−D/2 exp{−μu2/2 + log N }.
- Θis 2ds + Θ∞ + W∞, which gives, after applying Gronwall’s inequality with the bounds of Lemma 30: P ∆i(t) ≤ KeKT log N + z ≥ 1 − e−z2.
- Bound between nonlinear dynamics and particle dynamics Proposition 35 (ND-PD) There exists a constant K, such that with probability at least 1 − e−z2, the authors have sup max t∈[0,T ] i∈[N ]

Summary

- Multi-layer neural networks, and in particular multi-layer perceptrons, present a number of remarkable features.
- Mean-field theory allowed them to prove global convergence guarantees for SGD in two-layers neural networks Mei et al (2018); Chizat and Bach (2018a).
- Mei et al (2018) proves quantitative bounds to approximate SGD by the mean-field dynamics.
- Theorem 4 (B) is the first quantitative bound approximating noisy SGD by the distributional dynamics, for the case of unbounded coefficients.
- Bound between PDE and nonlinear dynamics Proposition 13 (PDE-ND) There exists a constant K depending only on the Ki, i = 1, 2, 3, such that with probability at least 1 − e−z2, the authors have sup
- Bound between particle dynamics and GD Proposition 18 (PD-GD) There exists a constant K such that: sup max k∈[0,t/ε]∩N i≤N
- Proposition 19 (GD-SGD) There exists a constant K, such that with probability at least 1 − e−z2, the authors have sup max k∈[0,T /ε]∩N i∈[N ]
- Proposition 23 (PDE-ND) There exists a constant K, such that with probability at least 1 − e−z2, the authors have sup |RN − R| ≤ K(1 + T )4 √1 [ log(N T ) + z]
- Lemma 25 (Term the author bound) There exists a constant K, such that sup |RN − ERN| ≤ K(1 + T )4[
- Proposition 29 (GD-SGD) There exists constants K and K0, such that if the authors take ε ≤ 1/[K0(D + log N + z2)eK0(1+T )3], the following holds with probability at least 1 − e−z2: for any t ≤ T , the authors have sup k∈[0,t/ε]∩N
- Proof [Proof of Lemma 30] Let them first consider a generic D-dimensional K2-sub-Gaussian random vector X, the authors have: EX [exp{μ
- Taking the union bound over i ∈ [N ] gives: P max sup Wi(t) 2 ≥ u ≤ (1 − 2μτ T /D)−D/2 exp{−μu2/2 + log N }.
- Θis 2ds + Θ∞ + W∞, which gives, after applying Gronwall’s inequality with the bounds of Lemma 30: P ∆i(t) ≤ KeKT log N + z ≥ 1 − e−z2.
- Bound between nonlinear dynamics and particle dynamics Proposition 35 (ND-PD) There exists a constant K, such that with probability at least 1 − e−z2, the authors have sup max t∈[0,T ] i∈[N ]

Related work

- As mentioned above, classical approximation theory already uses (either implicitly or explicitly) the idea of lifting the class of N -neurons neural networks, cf. Eq (1), to the infinite-dimensional space (5) parametrized by probability distributions ρ, see e.g. Cybenko (1989); Barron (1993); Bartlett (1998); Anthony and Bartlett (2009). This idea was exploited algorithmically, e.g. in Bengio et al (2006); Nitanda and Suzuki (2017).

Only very recently (stochastic) gradient descent was proved to converge (for large enough number of neurons) to the infinite-dimensional evolution (DD) Mei et al (2018); Rotskoff and VandenEijnden (2018); Sirignano and Spiliopoulos (2018); Chizat and Bach (2018a). In particular, Mei et al (2018) proves quantitative bounds to approximate SGD by the mean-field dynamics. Our work is mainly motivated by the objective to obtain a better scaling with dimension and to allow for unbounded second-layer coefficients.

Funding

- This work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729

Reference

- Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. arXiv:1811.03962, 2018.
- Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
- Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv:1901.08584, 2019.
- Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
- Peter L Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE transactions on Information Theory, 44(2):525–536, 1998.
- Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex neural networks. In Advances in neural information processing systems, pages 123–130, 2006.
- Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. arXiv:1805.09545, 2018a.
- Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. arXiv:1812.07956, 2018b.
- George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
- Xialiang Dou and Tengyuan Liang. Training neural networks as learning data-adaptive kernels: Provable representation and approximation benefits. arXiv:1901.07114, 2019.
- Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv:1811.03804, 2018a.
- Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv:1810.02054, 2018b.
- Lawrence C. Evans. Partial Differential Equations. Springer, 2009.
- Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stephane d’Ascoli, Giulio Biroli, Clement Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. arXiv:1901.01608, 2019.
- Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. arXiv:1806.07572, 2018.
- Adel Javanmard, Marco Mondelli, and Andrea Montanari. Analysis of a two-layer neural network via displacement convexity. arXiv:1901.01375, 2019.
- Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker– planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
- Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8168– 8177, 2018.
- Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 2018. ISSN 00278424. doi: 10.1073/pnas.1806579115. URL http://www.pnas.org/content/early/2018/07/26/1806579115.
- Atsushi Nitanda and Taiji Suzuki. Stochastic particle gradient descent for infinite ensembles. arXiv:1712.05438, 2017.
- Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? arXiv:1812.10004, 2018.
- Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv:1805.00915, 2018.
- Birkhauser, 2015.
- Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks. arXiv:1805.01053, 2018.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Springer, 1991. Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks. arXiv:1810.05369, 2018. Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv:1811.08888, 2018.

Tags

Comments