# Deep Residual Learning for Image Recognition

CVPR, 2016.

EI

Weibo:

Abstract:

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We pro...More

Introduction

- Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 49, 39].
- The shortcut connections perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2).
- On the ImageNet classification dataset [35], the authors obtain excellent results by extremely deep residual nets.
- The papers of [38, 37, 31, 46] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections.

Highlights

- Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 49, 39]
- Deep networks naturally integrate low/mid/highlevel features [49] and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers
- Recent evidence [40, 43] reveals that network depth is of crucial importance, and the leading results [40, 43, 12, 16] on the challenging ImageNet dataset [35] all exploit “very deep” [40] models, with a depth of sixteen [40] to thirty [16]
- We address the degradation problem by introducing a deep residual learning framework
- We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets exhibit higher training error when the depth increases; 2) Our deep residual nets can enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks
- We argue that this is because the zero-padded dimensions in A have no residual learning

Results

- When a gated shortcut is “closed”, the layers in highway networks represent non-residual functions.
- As the authors discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart.
- The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers.
- With the residual learning reformulation, if identity mappings are optimal, the solvers may drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.
- The authors show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.
- Based on the above plain network, the authors insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version.
- The 34-layer plain net is still able to achieve competitive accuracy (Table 3), suggesting that the solver works to some extent.
- The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3×3 filters as in Fig. 3.
- In the first comparison (Table 2 and Fig. 4 right), the authors use identity mapping for all shortcuts and zero-padding for increasing dimensions.
- The authors note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs left).
- For each residual function F, the authors use a stack of 3 layers instead of 2 (Fig. 5).

Conclusion

- The depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 billion FLOPs).
- The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper.
- The authors further explore n = 18 that leads to a 110-layer ResNet. In this case, the authors find that the initial learning rate of 0.1 is slightly too large to start converging5.
- The authors' method shows no optimization difficulty, and this 103-layer network is able to achieve training error <0.1% (Fig. 6, right).

Summary

- Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 49, 39].
- The shortcut connections perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2).
- On the ImageNet classification dataset [35], the authors obtain excellent results by extremely deep residual nets.
- The papers of [38, 37, 31, 46] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections.
- When a gated shortcut is “closed”, the layers in highway networks represent non-residual functions.
- As the authors discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart.
- The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers.
- With the residual learning reformulation, if identity mappings are optimal, the solvers may drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.
- The authors show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.
- Based on the above plain network, the authors insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version.
- The 34-layer plain net is still able to achieve competitive accuracy (Table 3), suggesting that the solver works to some extent.
- The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3×3 filters as in Fig. 3.
- In the first comparison (Table 2 and Fig. 4 right), the authors use identity mapping for all shortcuts and zero-padding for increasing dimensions.
- The authors note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs left).
- For each residual function F, the authors use a stack of 3 layers instead of 2 (Fig. 5).
- The depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 billion FLOPs).
- The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper.
- The authors further explore n = 18 that leads to a 110-layer ResNet. In this case, the authors find that the initial learning rate of 0.1 is slightly too large to start converging5.
- The authors' method shows no optimization difficulty, and this 103-layer network is able to achieve training error <0.1% (Fig. 6, right).

- Table1: Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Downsampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2
- Table2: Top-1 error (%, 10-crop testing) on ImageNet validation. Here the ResNets have no extra parameter compared to their plain counterparts
- Table3: Error rates (%, 10-crop testing) on ImageNet validation. VGG-16 is based on our test. ResNet-50/101/152 are of option B that only uses projections for increasing dimensions
- Table4: Error rates (%) of single-model results on the ImageNet validation set (except † reported on the test set)
- Table5: Error rates (%) of ensembles. The top-5 error is on the test set of ImageNet and reported by the test server
- Table6: Classification error on the CIFAR-10 test set. All methods are with data augmentation. For ResNet-110, we run it 5 times and show “best (mean±std)” as in [<a class="ref-link" id="c42" href="#r42">42</a>]
- Table7: Object detection mAP (%) on the PASCAL VOC 2007/2012 test sets using baseline Faster R-CNN. See also appendix for better results
- Table8: Object detection mAP (%) on the COCO validation set using baseline Faster R-CNN. See also appendix for better results

Related work

- Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 47]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.

In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis preconditioning [44, 45], which relies on variables that represent residual vectors between two scales. It has been shown [3, 44, 45] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.

Funding

- Presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously
- Provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth
- Evaluates residual nets with a depth of up to 152 layers—8× deeper than VGG nets but still having lower complexity
- Addresses the degradation problem by introducing a deep residual learning framework

Study subjects and analysis

major observations: 3

So they have no extra parameter compared to the plain counterparts. We have three major observations from Table 2 and Fig. 4. First, the situation is reversed with residual learning – the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%)

Reference

- Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
- C. M. Bishop. Neural networks for pattern recognition. Oxford university press, 1995.
- W. L. Briggs, S. F. McCormick, et al. A Multigrid Tutorial. Siam, 2000.
- K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011.
- M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, pages 303–338, 2010.
- R. Girshick. Fast R-CNN. In ICCV, 2015.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
- I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv:1302.4389, 2013.
- K. He and J. Sun. Convolutional neural networks at constrained time cost. In CVPR, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.
- K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
- G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv:1207.0580, 2012.
- S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma thesis, TU Munich, 1991.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. TPAMI, 33, 2011.
- H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. Aggregating local image descriptors into compact codes. TPAMI, 2012.
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
- A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
- A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
- Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller. Efficient backprop. In Neural Networks: Tricks of the Trade, pages 9–50.
- C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. arXiv:1409.5185, 2014.
- M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014.
- J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- G. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In NIPS, 2014.
- V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
- F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.
- T. Raiko, H. Valpola, and Y. LeCun. Deep learning made easier by linear transformations in perceptrons. In AISTATS, 2012.
- S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
- B. D. Ripley. Pattern recognition and neural networks. Cambridge university press, 1996.
- A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014.
- A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.
- N. N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical report, 1998.
- [39] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
- [40] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- [41] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv:1505.00387, 2015.
- [42] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. 1507.06228, 2015.
- [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
- [44] R. Szeliski. Fast surface interpolation using hierarchical basis functions. TPAMI, 1990.
- [45] R. Szeliski. Locally adapted hierarchical basis preconditioning. In SIGGRAPH, 2006.
- [46] T. Vatanen, T. Raiko, H. Valpola, and Y. LeCun. Pushing stochastic gradient towards second-order methods–backpropagation learning with transformations in nonlinearities. In Neural Information Processing, 2013.
- [47] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008.
- [48] W. Venables and B. Ripley. Modern applied statistics with s-plus. 1999.
- [49] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. In ECCV, 2014.

Best Paper

Best Paper of CVPR, 2016

Tags

Comments