Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

ICLR, 2020.

Cited by: 24|Views111
EI
Weibo:
We studied the effect of the initialization parameter values of deep linear neural networks on the convergence time of gradient descent

Abstract:

The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the conc...More

Code:

Data:

Full Text
Bibtex
Weibo
Introduction
  • Through their myriad successful applications across a wide range of disciplines, it is well established that deep neural networks possess an unprecedented ability to model complex real-world datasets, and in many cases they can do so with minimal overfitting.
  • The list of practical achievements of deep learning has grown at an astonishing rate, and includes models capable of human-level performance in tasks such as image recognition (Krizhevsky et al, 2012), speech recognition (Hinton et al, 2012), and machine translation (Wu et al, 2016).
  • Given a candidate network architecture, some of the most impactful hyperparameters are those governing the choice of the model’s initial weights.
  • Considerable study has been devoted to the selection of initial weights, relatively little has been proved about how these choices affect important quantities such as rate of convergence of gradient descent
Highlights
  • Through their myriad successful applications across a wide range of disciplines, it is now well established that deep neural networks possess an unprecedented ability to model complex real-world datasets, and in many cases they can do so with minimal overfitting
  • We show that for deep networks, the width needed for efficient convergence for orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence of Gaussian networks scales linearly in the depth
  • We studied the effect of the initialization parameter values of deep linear neural networks on the convergence time of gradient descent
  • We found that when the initial weights are iid Gaussian, the convergence time grows exponentially in the depth unless the width is at least as large as the depth
  • When the initial weight matrices are drawn from the orthogonal group, the width needed to guarantee efficient convergence is independent of the depth
  • These results establish for the first time a concrete proof that orthogonal initialization is superior to Gaussian initialization in terms of convergence time
Methods
  • We provide empirical evidence to support the results in Sections 4 and 5.
  • Each network is trained using gradient descent staring from both Gaussian and orthogonal initializations.
  • For Gaussian initialization, this transition occurs across a contour characterized by a linear relation between width and depth;.
  • For orthogonal initialization, the transition occurs at a width that is approximately independent of the depth.
  • These observations excellently verify our theory developed in Sections 4 and 5
Conclusion
  • We studied the effect of the initialization parameter values of deep linear neural networks on the convergence time of gradient descent.
  • We found that when the initial weights are iid Gaussian, the convergence time grows exponentially in the depth unless the width is at least as large as the depth.
  • When the initial weight matrices are drawn from the orthogonal group, the width needed to guarantee efficient convergence is independent of the depth.
  • These results establish for the first time a concrete proof that orthogonal initialization is superior to Gaussian initialization in terms of convergence time
Summary
  • Introduction:

    Through their myriad successful applications across a wide range of disciplines, it is well established that deep neural networks possess an unprecedented ability to model complex real-world datasets, and in many cases they can do so with minimal overfitting.
  • The list of practical achievements of deep learning has grown at an astonishing rate, and includes models capable of human-level performance in tasks such as image recognition (Krizhevsky et al, 2012), speech recognition (Hinton et al, 2012), and machine translation (Wu et al, 2016).
  • Given a candidate network architecture, some of the most impactful hyperparameters are those governing the choice of the model’s initial weights.
  • Considerable study has been devoted to the selection of initial weights, relatively little has been proved about how these choices affect important quantities such as rate of convergence of gradient descent
  • Methods:

    We provide empirical evidence to support the results in Sections 4 and 5.
  • Each network is trained using gradient descent staring from both Gaussian and orthogonal initializations.
  • For Gaussian initialization, this transition occurs across a contour characterized by a linear relation between width and depth;.
  • For orthogonal initialization, the transition occurs at a width that is approximately independent of the depth.
  • These observations excellently verify our theory developed in Sections 4 and 5
  • Conclusion:

    We studied the effect of the initialization parameter values of deep linear neural networks on the convergence time of gradient descent.
  • We found that when the initial weights are iid Gaussian, the convergence time grows exponentially in the depth unless the width is at least as large as the depth.
  • When the initial weight matrices are drawn from the orthogonal group, the width needed to guarantee efficient convergence is independent of the depth.
  • These results establish for the first time a concrete proof that orthogonal initialization is superior to Gaussian initialization in terms of convergence time
Related work
  • Deep linear networks. Despite the simplicity of their input-output maps, deep linear networks define high-dimensional non-convex optimization landscapes whose properties closely reflect those of their non-linear counterparts. For this reason, deep linear networks have been the subject of extensive theoretical analysis. A line of work (Kawaguchi, 2016; Hardt & Ma, 2016; Lu & Kawaguchi, 2017; Yun et al, 2017; Zhou & Liang, 2018; Laurent & von Brecht, 2018) studied the landscape properties of deep linear networks. Although it was established that all local minima are global under certain assumptions, these properties alone are still not sufficient to guarantee global convergence or to provide a concrete rate of convergence for gradient-based optimization algorithms. Another line of work directly analyzed the trajectory taken by gradient descent and established conditions that guarantee convergence to global minimum (Bartlett et al, 2018; Arora et al, 2018; Du & Hu, 2019). Most relevant to our work is the result of Du & Hu (2019), which shows that if the width of hidden layers is larger than the depth, gradient descent with Gaussian initialization can efficiently converge to a global minimum. Our result establishes that for Gaussian initialization, this linear dependence between width and depth is necessary, while for orthogonal initialization, the width can be independent of depth. Our negative result for Gaussian initialization also significantly generalizes the result of Shamir (2018), who proved a similar negative result for 1-dimensional linear networks.
Funding
  • Shows that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth
  • Demonstrates how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry
  • Examines the effect of initialization on the rate of convergence of gradient descent in deep linear networks
  • Provides for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights
  • Presents our main positive result on efficient convergence from orthogonal initialization in Section 4
Reference
  • Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
    Findings
  • Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. arXiv preprint arXiv:1810.02281, 2018.
    Findings
  • Peter Bartlett, Dave Helmbold, and Phil Long. Gradient descent with identity initialization efficiently learns positive definite linear transformations. In International Conference on Machine Learning, pp. 520–529, 2018.
    Google ScholarLocate open access versionFindings
  • Minmin Chen, Jeffrey Pennington, and Samuel S Schoenholz. Dynamical isometry and a mean field theory of rnns: Gating enables signal propagation in recurrent neural networks. arXiv preprint arXiv:1806.05394, 2018.
    Findings
  • Simon Du and Wei Hu. Width provably matters in optimization for deep linear neural networks. In International Conference on Machine Learning, pp. 1655–1664, 2019.
    Google ScholarLocate open access versionFindings
  • Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S Schoenholz, Ed H Chi, and Jeffrey Pennington. Dynamical isometry and a mean field theory of lstms and grus. arXiv preprint arXiv:1901.08987, 2019.
    Findings
  • Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017.
    Google ScholarLocate open access versionFindings
  • Moritz Hardt and Tengyu Ma. Identity matters in deep learning. International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • Mikael Henaff, Arthur Szlam, and Yann LeCun. Recurrent orthogonal networks and long-memory tasks. arXiv preprint arXiv:1602.06662, 2016.
    Findings
  • Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
    Google ScholarLocate open access versionFindings
  • Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586–594, 2016.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • Thomas Laurent and James von Brecht. A recurrent neural network without chaos. arXiv preprint arXiv:1612.06212, 2016.
    Findings
  • Thomas Laurent and James von Brecht. Deep linear networks with arbitrary loss: All local minima are global. In International Conference on Machine Learning, pp. 2908–2913, 2018.
    Google ScholarLocate open access versionFindings
  • Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
    Findings
  • Zenan Ling and Robert C Qiu. Spectrum concentration in deep residual learning: a free probability approach. IEEE Access, 7:105212–105223, 2019.
    Google ScholarLocate open access versionFindings
  • Haihao Lu and Kenji Kawaguchi. Depth creates no bad local minima. arXiv preprint arXiv:1702.08580, 2017.
    Findings
  • Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. Efficient orthogonal parametrisation of recurrent neural networks using householder reflections. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2401–2409. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in neural information processing systems, pp. 4785–4795, 2017.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Samuel S Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. arXiv preprint arXiv:1802.09979, 2018.
    Findings
  • Feng Qi and Qiu-Ming Luo. Bounds for the ratio of two gamma functions—from wendel’s and related inequalities to logarithmically completely monotonic functions. Banach Journal of Mathematical Analysis, 6(2):132–158, 2012.
    Google ScholarLocate open access versionFindings
  • Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations, 2014.
    Google ScholarLocate open access versionFindings
  • Ohad Shamir. Exponential convergence time of gradient descent for one-dimensional deep linear neural networks. arXiv preprint arXiv:1809.08587, 2018.
    Findings
  • Wojciech Tarnowski, Piotr Warchoł, Stanisław Jastrzbski, Jacek Tabor, and Maciej Nowak. Dynamical isometry is achieved in residual networks in a universal way for any activation function. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2221–2230, 2019.
    Google ScholarLocate open access versionFindings
  • Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. On orthogonality and learning recurrent networks with long term dependencies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3570–3578. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-capacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 4880–4888, 2016.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
    Findings
  • Published as a conference paper at ICLR 2020 Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington.
    Google ScholarFindings
  • Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pp. 5389–5398, 2018. Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444, 2017. Yi Zhou and Yingbin Liang. Critical points of linear neural networks: Analytical forms and landscape properties. 2018.
    Findings
Your rating :
0

 

Tags
Comments