# Pure and Spurious Critical Points: a Geometric Study of Linear Networks

ICLR, 2020.

EI

Keywords:

Loss landscape linear networks algebraic geometry

Weibo:

Abstract:

The critical locus of the loss function of a neural network is determined by the geometry of the functional space and by the parameterization of this space by the network's weights. We introduce a natural distinction between pure critical points, which only depend on the functional space, and spurious critical points, which arise from the...More

Introduction

- A fundamental goal in the theory of deep learning is to explain why the optimization of the nonconvex loss function of a neural network does not seem to be affected by the presence of nonglobal local minima.
- Many papers have addressed this issue by studying the landscape of the loss function (Baldi & Hornik, 1989; Choromanska et al, 2015; Kawaguchi, 2016; Venturi et al, 2018)
- These papers have shown that, in certain situations, any local minimum for the loss is always a global minimum.
- A complete understanding of the critical locus should be a prerequisite for investigating the dynamics of the optimization

Highlights

- A fundamental goal in the theory of deep learning is to explain why the optimization of the nonconvex loss function of a neural network does not seem to be affected by the presence of nonglobal local minima
- We prove that non-global local minima are necessarily pure critical points for convex losses, which means that many properties of the loss landscape can be read from the functional space
- We show that non-global local minima are always pure for convex losses, unifying many known properties on the landscape of linear networks
- We explain that the absence of “bad” local minima in the loss landscape of linear networks is due to two distinct phenomena and does not hold in general: it is true for arbitrary smooth convex losses in the case of architectures that can express all linear maps (“filling architectures”) and it holds for the quadratic loss when the functional space is a determinantal variety (“non-filling architectures”)
- We have introduced the notions of pure and spurious critical points as general tools for a geometric investigation of the landscape of neural networks
- We have focused on the landscape of linear networks

Results

- We say that the map μd is filling if r = min{d0, dh}, so Mr = Rdh×d0
- In this case, the functional space is smooth and convex.
- We say that the map μd is non-filling if r < min{d0, dh}, so Mr Rdh×d0 is a determinantal variety.
- In this case, the functional space is non-smooth and non-convex

Conclusion

- We have introduced the notions of pure and spurious critical points as general tools for a geometric investigation of the landscape of neural networks.
- They provide a basic language for describing the interplay between a convex loss function and an overparameterized, non-convex functional space.
- We have observed that even in this simple framework global minima can have many disconnected components

Summary

## Introduction:

A fundamental goal in the theory of deep learning is to explain why the optimization of the nonconvex loss function of a neural network does not seem to be affected by the presence of nonglobal local minima.- Many papers have addressed this issue by studying the landscape of the loss function (Baldi & Hornik, 1989; Choromanska et al, 2015; Kawaguchi, 2016; Venturi et al, 2018)
- These papers have shown that, in certain situations, any local minimum for the loss is always a global minimum.
- A complete understanding of the critical locus should be a prerequisite for investigating the dynamics of the optimization
## Objectives:

The goal of this paper is to revisit the loss function of neural networks from a geometric perspective, focusing on the relationship between the functional space of the network and its parameterization.## Results:

We say that the map μd is filling if r = min{d0, dh}, so Mr = Rdh×d0- In this case, the functional space is smooth and convex.
- We say that the map μd is non-filling if r < min{d0, dh}, so Mr Rdh×d0 is a determinantal variety.
- In this case, the functional space is non-smooth and non-convex
## Conclusion:

We have introduced the notions of pure and spurious critical points as general tools for a geometric investigation of the landscape of neural networks.- They provide a basic language for describing the interplay between a convex loss function and an overparameterized, non-convex functional space.
- We have observed that even in this simple framework global minima can have many disconnected components

- Table1: Bad local minima in loss landscapes for linear networks filling non-filling quadratic loss no bad minima no bad minima
- Table2: Number of critical points (columns) and number of minima (rows) in our experiments

Related work

- Baldi & Hornik (1989) first proved the absence of non-global (“bad”) local minima for linear networks with one hidden layer (autoencoders). Their result was generalized to the case of deep linear networks by Kawaguchi (2016). Many papers have since then studied the loss landscape of linear networks under different assumptions (Hardt & Ma, 2016; Yun et al, 2017; Zhou & Liang, 2017; Laurent & von Brecht, 2017; Lu & Kawaguchi, 2017; Zhang, 2019). In particular, Laurent & von Brecht (2017) showed that linear networks with “no bottlenecks” have no bad local minima for arbitrary smooth loss functions. Lu & Kawaguchi (2017) and Zhang (2019) argued that “depth does not create local minima”, meaning that the absence of local minima of deep linear networks is implied by the same property of shallow linear networks. Our study of pure and spurious critical points can be used as a framework for explaining all these results in a unified way. The optimization dynamics of linear networks are also an active area of research (Arora et al, 2019; 2018), and our analysis of the landscape in function space sets the stage for studying gradient dynamics on determinantal varieties, as in Bah et al (2019). Our work is also closely related to objects of study in applied algebraic geometry, particularly determinantal varieties and ED discriminants (Draisma et al, 2013; Ottaviani et al, 2013). Finally, we mention other recent works that study neural networks using algebraic-geometric tools (Mehta et al, 2018; Kileel et al, 2019; Jaffali & Oeding, 2019).

Funding

- We are gratuful to ICERM (NSF DMS-1439786 and the Simons Foundation grant 507536) for the hospitality during the academic year 2018/2019 where many ideas for this project were developed
- MT and JB were partially supported by the Alfred P
- KK was partially supported by the Knut and Alice Wallenbergs Foundation within their WASP AI/Math initiative

Study subjects and analysis

cases: 2

As previously noted, the image of μd is Mr ⊂ Rdh×d0 where r = min{di}. In particular, we distinguish between two cases:. • We say that the map μd is filling if r = min{d0, dh}, so Mr = Rdh×d0

Reference

- Shun-ichi Amari. Information Geometry and Its Applications, volume 194 of Applied Mathematical Sciences. Springer Japan, Tokyo, 2016. ISBN 978-4-431-55977-1 978-4-431-55978-8. doi: 10.1007/978-4-431-55978-8.
- Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. arXiv preprint arXiv:1810.02281, 2018.
- Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. arXiv preprint arXiv:1905.13655, 2019.
- Bubacarr Bah, Holger Rauhut, Ulrich Terstiege, and Michael Westdickenberg. Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers. arXiv:1910.05505 [cs, math], November 2019.
- Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1):53–58, January 1989. ISSN 08936080. doi: 10.1016/0893-6080(89)90014-2.
- Lenaic Chizat and Francis Bach. On the Global Convergence of Gradient Descent for Overparameterized Models using Optimal Transport. arXiv:1805.09545 [cs, math, stat], May 2018.
- Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015, 2015.
- Jan Draisma and Emil Horobet. The average number of critical rank-one approximations to a tensor. arXiv:1408.3507 [math], August 2014.
- Jan Draisma, Emil Horobet, Giorgio Ottaviani, Bernd Sturmfels, and Rekha R. Thomas. The Euclidean distance degree of an algebraic variety. arXiv:1309.0049 [math], August 2013.
- Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon Wilson. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. arXiv:1802.10026 [cs, stat], October 2018.
- Daniel R. Grayson and Michael E. Stillman. Macaulay2, a software system for research in algebraic geometry. Available at http://www.math.uiuc.edu/Macaulay2/, 2019.
- Moritz Hardt and Tengyu Ma. Identity Matters in Deep Learning. arXiv:1611.04231 [cs, stat], November 2016.
- Joe Harris. Algebraic Geometry: A First Course. Number 133 in Graduate Texts in Mathematics. Springer, New York, corr. 3rd print edition, 1995. ISBN 978-0-387-97716-4.
- Hamza Jaffali and Luke Oeding. Learning algebraic models of quantum entanglement. arXiv preprint arXiv:1908.10247, 2019.
- Kenji Kawaguchi. Deep Learning without Poor Local Minima. CoRR, abs/1605.07110, 2016.
- Joe Kileel, Matthew Trager, and Joan Bruna. On the expressive power of deep polynomial neural networks. arXiv preprint arXiv:1905.12207, 2019.
- Thomas Laurent and James von Brecht. Deep linear neural networks with arbitrary loss: All local minima are global. arXiv:1712.01473 [cs, stat], December 2017.
- John M. Lee. Introduction to Smooth Manifolds. Number 218 in Graduate Texts in Mathematics. Springer, New York, 2003. ISBN 978-0-387-95495-0 978-0-387-95448-6.
- Haihao Lu and Kenji Kawaguchi. Depth Creates No Bad Local Minima. arXiv:1702.08580 [cs, math, stat], February 2017.
- Dhagash Mehta, Tianran Chen, Tingting Tang, and Jonathan D. Hauenstein. The loss surface of deep linear networks viewed through the algebraic geometry lens. arXiv:1810.07716, 2018. URL http://arxiv.org/abs/1810.07716.
- Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A Mean Field View of the Landscape of Two-Layers Neural Networks. arXiv:1804.06561 [cond-mat, stat], April 2018.
- Giorgio Ottaviani, Pierre-Jean Spaenlehauer, and Bernd Sturmfels. Exact Solutions in Structured Low-Rank Approximation. arXiv:1311.2376 [cs, math, stat], November 2013.
- Luca Venturi, Afonso S Bandeira, and Joan Bruna. Spurious valleys in two-layer neural network optimization landscapes. arXiv preprint arXiv:1802.06384, 2018.
- Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444, 2017.
- Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small nonlinearities in activation functions create bad local minima in neural networks. arXiv preprint arXiv:1802.03487, 2018.
- Li Zhang. Depth creates no more spurious local minima. arXiv preprint arXiv:1901.09827, 2019.
- Yi Zhou and Yingbin Liang. Critical Points of Neural Networks: Analytical Forms and Landscape Properties. arXiv:1710.11205 [cs, stat], October 2017.
- GL(n, C)) the ED degree of φ(V) is the same; see Theorem 5.4 in Draisma et al. (2013). This quantity is known as the general ED degree of V. For instance, almost all linear coordinate changes will deform a circle into an ellipse, such that the general ED degree of the circle is four.
- As in the case of a circle, the general ED degree of the determinantal variety Mr is not equal to the ED degree of Mr. Furthermore, there is no known closed formula for the general ED degree of Mr only involving the parameters dx, dy and r. In the special case of rank-one matrices, one can derive a closed expression from the Catanese-Trifogli formula (Theorem 7.8 in Draisma et al. (2013)): the general ED degree of M1 is
- This expression yields 39 for dx = dy = 3, as mentioned in Example 13. For general r, formulas for the general ED degree of Mr involving Chern and polar classes can be found in Ottaviani et al. (2013); Draisma et al. (2013). A short algorithm to compute the general ED degree of Mr is given in Example 7.11 of Draisma et al. (2013); it uses a package for advanced intersection theory in the algebro-geometric software Macaulay2 (Grayson & Stillman, 2019).
- Experiment 1. In general, it is very difficult to describe the open regions in Rn that are separated by the ED discriminant of a variety V ⊂ Rn. Finding the “typical” number of real critical points for the distance function hu restricted to V, requires the computation of the volumes of these open regions. In the current state of the art in real algebraic geometry, this is only possible for very particular varieties V. For these reasons, and to get more insights on the typical number of real critical points of determinantal varieties with a perturbed Euclidean distance, we performed computational experiments with Macaulay2 (Grayson & Stillman, 2019) in the situation of Example 13. We fixed the

Tags

Comments