# Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), pp. 14574-14583, 2019.

EI

Keywords:

Weibo:

Abstract:

Mode connectivity (Garipov et al., 2018; Draxler et al., 2018) is a surprising phenomenon in the loss landscape of deep nets. Optima-at least those discovered by gradient-based optimization-turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with...More

Code:

Data:

Introduction

- Efforts to understand how and why deep learning works have led to a focus on the optimization landscape of the training loss.
- Mode connectivity and spurious valleys Fixing a neural network architecture, a data set D and a loss function, the authors say two sets of parameters/solutions ✓A and ✓B are ✏-connected if there is a path that is continuous with respect to t and satisfies: 1.

Highlights

- Efforts to understand how and why deep learning works have led to a focus on the optimization landscape of the training loss
- Mode connectivity and spurious valleys Fixing a neural network architecture, a data set D and a loss function, we say two sets of parameters/solutions ✓A and ✓B are ✏-connected if there is a path that is continuous with respect to t and satisfies: 1
- In other words, letting C be the set of solutions that are ✏-dropout stable, a ReLU network has the ✏-mode connectivity property with respect to C
- Suppose there exists a network ✓⇤ with layer width h⇤i for each layer i that achieves 1lnsOeope (satpswcjLeopr✏( ⇡kf)s,✓:h.L⇤[i)0F(,fo1a1✓r,n⇤]ak)d!n[}Amyfio⇥id]nrjrikob0mpe=u towmuOtetheankbaAlinalidiyktyeF✓rB1tw.h5sieudmncthhathxhteh1⇤mr aeitine Lxd=i(sft1s⇡⌦e((ah(t)1⇤ip))/a. htLhie)mwt ia✓txhAp{1aL 3ns,e+bigfemOfetoweo2,piLas erteld+er, Intuitively, we prove this theorem by connecting via the neural network with narrow hidden layers
- We demonstrate that our assumptions and theoretical findings accurately characterize mode connectivity in practical settings
- We demonstrate that the training loss and accuracy obtained via the path construction in Theorem 3 between two noise stable VGG-11 networks ✓A and ✓B remain fairly low and high respectively— in comparison to directly interpolating between the two networks, which incurs loss as high as 2.34 and accuracy as low as 10%, as shown in Section D.2

Results

- When the network only has the mode connectivity property with respect to a class of solutions C, as long as the class C contains a global minimizer, the authors know there are no spurious valleys in C.
- In other words, letting C be the set of solutions that are ✏-dropout stable, a ReLU network has the ✏-mode connectivity property with respect to C.
- The authors use a very similar notion of noise stability, and show that all noise stable solutions can be connected as long as the network is sufficiently overparametrized.
- Note that the quantity ✏ is small as long as the hidden layer width hmin is large compared to the noise stable parameters.
- Let ✓A and ✓B be two fully connected networks that are both ✏-noise stable, there exists a path with line segments in parameter space
- Venturi et al (2018) showed that spurious valleys can exist for 2-layer ReLU nets with an arbitrary number of hidden units, but again they do not extend their result to the overparametrized setting.
- Recall that the definition of dropout-stability merely requires the existence of a particular sub-network with half the width of the original that achieves low loss.
- As Theorem 3 suggests, if there exists a narrow network that achieves low loss, the authors need only be able to drop out a number of filters equal to the width of the narrow network to connect local minima.

Conclusion

- The authors demonstrate that the VGG-11 (Simonyan and Zisserman, 2014) architecture trained with channel-wise dropout (Tompson et al, 2015; Keshari et al, 2018) with p = 0.25 at the first three layers5 and p = 0.5 at the others on CIFAR-10 converges to a noise stable minima—as measured by layer cushion, interlayer cushion, activation contraction and interlayer smoothness.
- In Figure 3, the authors demonstrate that the training loss and accuracy obtained via the path construction in Theorem 3 between two noise stable VGG-11 networks ✓A and ✓B remain fairly low and high respectively— in comparison to directly interpolating between the two networks, which incurs loss as high as 2.34 and accuracy as low as 10%, as shown in Section D.2.

Summary

- Efforts to understand how and why deep learning works have led to a focus on the optimization landscape of the training loss.
- Mode connectivity and spurious valleys Fixing a neural network architecture, a data set D and a loss function, the authors say two sets of parameters/solutions ✓A and ✓B are ✏-connected if there is a path that is continuous with respect to t and satisfies: 1.
- When the network only has the mode connectivity property with respect to a class of solutions C, as long as the class C contains a global minimizer, the authors know there are no spurious valleys in C.
- In other words, letting C be the set of solutions that are ✏-dropout stable, a ReLU network has the ✏-mode connectivity property with respect to C.
- The authors use a very similar notion of noise stability, and show that all noise stable solutions can be connected as long as the network is sufficiently overparametrized.
- Note that the quantity ✏ is small as long as the hidden layer width hmin is large compared to the noise stable parameters.
- Let ✓A and ✓B be two fully connected networks that are both ✏-noise stable, there exists a path with line segments in parameter space
- Venturi et al (2018) showed that spurious valleys can exist for 2-layer ReLU nets with an arbitrary number of hidden units, but again they do not extend their result to the overparametrized setting.
- Recall that the definition of dropout-stability merely requires the existence of a particular sub-network with half the width of the original that achieves low loss.
- As Theorem 3 suggests, if there exists a narrow network that achieves low loss, the authors need only be able to drop out a number of filters equal to the width of the narrow network to connect local minima.
- The authors demonstrate that the VGG-11 (Simonyan and Zisserman, 2014) architecture trained with channel-wise dropout (Tompson et al, 2015; Keshari et al, 2018) with p = 0.25 at the first three layers5 and p = 0.5 at the others on CIFAR-10 converges to a noise stable minima—as measured by layer cushion, interlayer cushion, activation contraction and interlayer smoothness.
- In Figure 3, the authors demonstrate that the training loss and accuracy obtained via the path construction in Theorem 3 between two noise stable VGG-11 networks ✓A and ✓B remain fairly low and high respectively— in comparison to directly interpolating between the two networks, which incurs loss as high as 2.34 and accuracy as low as 10%, as shown in Section D.2.

Related work

- The landscape of the loss function for training neural networks has received a lot of attention. Dauphin et al (2014); Choromanska et al (2015) conjectured that local minima of multi-layer neural networks have similar loss function values, and proved the result in idealized settings. For linear networks, it is known (Kawaguchi, 2016) that all local minima are also globally optimal.

Several theoretical works have explored whether a neural network has spurious valleys (non-global minima that are surrounded by other points with higher loss). Freeman and Bruna (2016) showed that for a two-layer net, if it is sufficiently overparametrized then all the local minimizers are (approximately) connected. However, in order to guarantee a small loss along the path they need the number of neurons to be exponential in the number of input dimensions. Venturi et al (2018) proved that if the number of neurons is larger than either the number of training samples or the intrinsic dimension (infinite for standard architectures), then the neural network cannot have spurious valleys. Liang et al (2018) proved similar results for the binary classification setting. Nguyen et al (2018); Nguyen (2019) relaxed the requirement on overparametrization, but still require the output layer to have more direct connections than the number of training samples.

Funding

- Rong Ge acknowledges funding from NSF CCF-1704656, NSF CCF-1845171 (CAREER), the Sloan Fellowship and Google Faculty Research Award
- Sanjeev Arora acknowledges funding from the NSF, ONR, Simons Foundation, Schmidt Foundation, Amazon Research, DARPA and SRC

Reference

- Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296.
- Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2015). The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204.
- Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941.
- Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. A. (2018). Essentially no barriers in neural network energy landscape. arXiv preprint arXiv:1803.00885.
- Freeman, C. D. and Bruna, J. (2016). Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540.
- Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, pages 8789–8798.
- Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in neural information processing systems, pages 586–594.
- Keshari, R., Singh, R., and Vatsa, M. (2018). Guided dropout. arXiv preprint arXiv:1812.03965.
- Liang, S., Sun, R., Li, Y., and Srikant, R. (2018). Understanding the loss surface of neural networks for binary classification. In International Conference on Machine Learning, pages 2840–2849.
- Morcos, A. S., Barrett, D. G., Rabinowitz, N. C., and Botvinick, M. (2018). On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959.
- Nguyen, Q. (2019). On connected sublevel sets in deep learning. arXiv preprint arXiv:1901.07417.
- Nguyen, Q., Mukkamala, M. C., and Hein, M. (2018). On the loss landscape of a class of deep neural networks with no bad local valleys. arXiv preprint arXiv:1809.10749.
- Safran, I. and Shamir, O. (2018). Spurious local minima are common in two-layer relu neural networks. In International Conference on Machine Learning, pages 4430–4438.
- Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- Tompson, J., Goroshin, R., Jain, A., LeCun, Y., and Bregler, C. (2015). Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 648–656.
- Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434.
- Venturi, L., Bandeira, A. S., and Bruna, J. (2018). Spurious valleys in two-layer neural network optimization landscapes. arXiv preprint arXiv:1802.06384.
- Yun, C., Sra, S., and Jadbabaie, A. (2018). A critical view of global optimality in deep learning. arXiv preprint arXiv:1802.03487.

Tags

Comments