Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

    Theodor Misiakiewicz
    Theodor Misiakiewicz

    COLT, pp. 2388-2464, 2019.

    Cited by: 54|Bibtex|Views22|Links
    EI
    Keywords:
    particle dynamicsnonlinear dynamics and particle dynamics Proposition 26local minimumnonlinear dynamicsmulti layerMore(12+)
    Wei bo:
    We show that kernel ridge regression can be recovered as a special limit of the mean field analysis

    Abstract:

    We consider learning two layer neural networks using stochastic gradient descent. The mean-field description of this learning dynamics approximates the evolution of the network weights by an evolution in the space of probability distributions in $R^D$ (where $D$ is the number of parameters associated to each neuron). This evolution can be...More

    Code:

    Data:

    Introduction
    • Multi-layer neural networks, and in particular multi-layer perceptrons, present a number of remarkable features.
    • Mean-field theory allowed them to prove global convergence guarantees for SGD in two-layers neural networks Mei et al (2018); Chizat and Bach (2018a).
    • Mei et al (2018) proves quantitative bounds to approximate SGD by the mean-field dynamics.
    Highlights
    • Multi-layer neural networks, and in particular multi-layer perceptrons, present a number of remarkable features. They are effectively trained using stochastic-gradient descent (SGD) LeCun et al (1998); their behavior is fairly insensitive to the number of hidden units or to the input dimensions Srivastava et al (2014); their number of parameters is often larger than the number of samples
    • We prove a new bound that is dimension independent and more natural: keeping the evolution time T = O(1), the new results requires N 1 in order to get a vanishing approximation error ( to make the approximation error vanish, N should depend on the Lipschitz constants in the assumptions which may implicitly depend on dimension
    • For a suitable scaling of the initialization, kernel and mean field regimes appear at different time scales
    • We introduce the following metric on C([0, T ]; PC0,T (RD)): DT (m1, m2) = inf sup θ1t − θ2t 22γ : γ is a coupling of m1, m2
    • We introduce the distributional dynamics and residual dynamics, which we consider in the pre-limit and in the limit of infinite number of neurons
    Results
    • Theorem 4 (B) is the first quantitative bound approximating noisy SGD by the distributional dynamics, for the case of unbounded coefficients.
    • Bound between PDE and nonlinear dynamics Proposition 13 (PDE-ND) There exists a constant K depending only on the Ki, i = 1, 2, 3, such that with probability at least 1 − e−z2, the authors have sup
    • Bound between particle dynamics and GD Proposition 18 (PD-GD) There exists a constant K such that: sup max k∈[0,t/ε]∩N i≤N
    • Proposition 19 (GD-SGD) There exists a constant K, such that with probability at least 1 − e−z2, the authors have sup max k∈[0,T /ε]∩N i∈[N ]
    • Proposition 23 (PDE-ND) There exists a constant K, such that with probability at least 1 − e−z2, the authors have sup |RN − R| ≤ K(1 + T )4 √1 [ log(N T ) + z]
    • Lemma 25 (Term the author bound) There exists a constant K, such that sup |RN − ERN| ≤ K(1 + T )4[
    • Proposition 29 (GD-SGD) There exists constants K and K0, such that if the authors take ε ≤ 1/[K0(D + log N + z2)eK0(1+T )3], the following holds with probability at least 1 − e−z2: for any t ≤ T , the authors have sup k∈[0,t/ε]∩N
    • Proof [Proof of Lemma 30] Let them first consider a generic D-dimensional K2-sub-Gaussian random vector X, the authors have: EX [exp{μ
    Conclusion
    • Taking the union bound over i ∈ [N ] gives: P max sup Wi(t) 2 ≥ u ≤ (1 − 2μτ T /D)−D/2 exp{−μu2/2 + log N }.
    • Θis 2ds + Θ∞ + W∞, which gives, after applying Gronwall’s inequality with the bounds of Lemma 30: P ∆i(t) ≤ KeKT log N + z ≥ 1 − e−z2.
    • Bound between nonlinear dynamics and particle dynamics Proposition 35 (ND-PD) There exists a constant K, such that with probability at least 1 − e−z2, the authors have sup max t∈[0,T ] i∈[N ]
    Summary
    • Multi-layer neural networks, and in particular multi-layer perceptrons, present a number of remarkable features.
    • Mean-field theory allowed them to prove global convergence guarantees for SGD in two-layers neural networks Mei et al (2018); Chizat and Bach (2018a).
    • Mei et al (2018) proves quantitative bounds to approximate SGD by the mean-field dynamics.
    • Theorem 4 (B) is the first quantitative bound approximating noisy SGD by the distributional dynamics, for the case of unbounded coefficients.
    • Bound between PDE and nonlinear dynamics Proposition 13 (PDE-ND) There exists a constant K depending only on the Ki, i = 1, 2, 3, such that with probability at least 1 − e−z2, the authors have sup
    • Bound between particle dynamics and GD Proposition 18 (PD-GD) There exists a constant K such that: sup max k∈[0,t/ε]∩N i≤N
    • Proposition 19 (GD-SGD) There exists a constant K, such that with probability at least 1 − e−z2, the authors have sup max k∈[0,T /ε]∩N i∈[N ]
    • Proposition 23 (PDE-ND) There exists a constant K, such that with probability at least 1 − e−z2, the authors have sup |RN − R| ≤ K(1 + T )4 √1 [ log(N T ) + z]
    • Lemma 25 (Term the author bound) There exists a constant K, such that sup |RN − ERN| ≤ K(1 + T )4[
    • Proposition 29 (GD-SGD) There exists constants K and K0, such that if the authors take ε ≤ 1/[K0(D + log N + z2)eK0(1+T )3], the following holds with probability at least 1 − e−z2: for any t ≤ T , the authors have sup k∈[0,t/ε]∩N
    • Proof [Proof of Lemma 30] Let them first consider a generic D-dimensional K2-sub-Gaussian random vector X, the authors have: EX [exp{μ
    • Taking the union bound over i ∈ [N ] gives: P max sup Wi(t) 2 ≥ u ≤ (1 − 2μτ T /D)−D/2 exp{−μu2/2 + log N }.
    • Θis 2ds + Θ∞ + W∞, which gives, after applying Gronwall’s inequality with the bounds of Lemma 30: P ∆i(t) ≤ KeKT log N + z ≥ 1 − e−z2.
    • Bound between nonlinear dynamics and particle dynamics Proposition 35 (ND-PD) There exists a constant K, such that with probability at least 1 − e−z2, the authors have sup max t∈[0,T ] i∈[N ]
    Related work
    Funding
    • This work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729
    Reference
    • Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. arXiv:1811.03962, 2018.
      Findings
    • Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
      Google ScholarFindings
    • Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv:1901.08584, 2019.
      Findings
    • Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
      Google ScholarLocate open access versionFindings
    • Peter L Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE transactions on Information Theory, 44(2):525–536, 1998.
      Google ScholarLocate open access versionFindings
    • Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex neural networks. In Advances in neural information processing systems, pages 123–130, 2006.
      Google ScholarLocate open access versionFindings
    • Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. arXiv:1805.09545, 2018a.
      Findings
    • Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. arXiv:1812.07956, 2018b.
      Findings
    • George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
      Google ScholarLocate open access versionFindings
    • Xialiang Dou and Tengyuan Liang. Training neural networks as learning data-adaptive kernels: Provable representation and approximation benefits. arXiv:1901.07114, 2019.
      Findings
    • Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv:1811.03804, 2018a.
      Findings
    • Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv:1810.02054, 2018b.
      Findings
    • Lawrence C. Evans. Partial Differential Equations. Springer, 2009.
      Google ScholarFindings
    • Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stephane d’Ascoli, Giulio Biroli, Clement Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. arXiv:1901.01608, 2019.
      Findings
    • Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. arXiv:1806.07572, 2018.
      Findings
    • Adel Javanmard, Marco Mondelli, and Andrea Montanari. Analysis of a two-layer neural network via displacement convexity. arXiv:1901.01375, 2019.
      Findings
    • Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker– planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
      Google ScholarLocate open access versionFindings
    • Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
      Google ScholarLocate open access versionFindings
    • Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8168– 8177, 2018.
      Google ScholarLocate open access versionFindings
    • Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 2018. ISSN 00278424. doi: 10.1073/pnas.1806579115. URL http://www.pnas.org/content/early/2018/07/26/1806579115.
      Locate open access versionFindings
    • Atsushi Nitanda and Taiji Suzuki. Stochastic particle gradient descent for infinite ensembles. arXiv:1712.05438, 2017.
      Findings
    • Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? arXiv:1812.10004, 2018.
      Findings
    • Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv:1805.00915, 2018.
      Findings
    • Birkhauser, 2015.
      Google ScholarFindings
    • Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks. arXiv:1805.01053, 2018.
      Findings
    • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
      Google ScholarLocate open access versionFindings
    • Springer, 1991. Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks. arXiv:1810.05369, 2018. Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv:1811.08888, 2018.
      Findings
    Your rating :
    0

     

    Tags
    Comments