Deep Residual Learning for Image Recognition

CVPR, 2016.

Cited by: 47481|Bibtex|Views1079|Links
EI
Keywords:
deep convolutional networkdeep convolutional neural networkgradient descentbatch normalizationimagenet datasetMore(11+)
Weibo:
Deep networks naturally integrate low/mid/highlevel features and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers

Abstract:

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We pro...More

Code:

Data:

Introduction
  • Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 49, 39].
  • The shortcut connections perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2).
  • On the ImageNet classification dataset [35], the authors obtain excellent results by extremely deep residual nets.
  • The papers of [38, 37, 31, 46] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections.
Highlights
  • Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 49, 39]
  • Deep networks naturally integrate low/mid/highlevel features [49] and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers
  • Recent evidence [40, 43] reveals that network depth is of crucial importance, and the leading results [40, 43, 12, 16] on the challenging ImageNet dataset [35] all exploit “very deep” [40] models, with a depth of sixteen [40] to thirty [16]
  • We address the degradation problem by introducing a deep residual learning framework
  • We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets exhibit higher training error when the depth increases; 2) Our deep residual nets can enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks
  • We argue that this is because the zero-padded dimensions in A have no residual learning
Results
  • When a gated shortcut is “closed”, the layers in highway networks represent non-residual functions.
  • As the authors discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart.
  • The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers.
  • With the residual learning reformulation, if identity mappings are optimal, the solvers may drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.
  • The authors show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.
  • Based on the above plain network, the authors insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version.
  • The 34-layer plain net is still able to achieve competitive accuracy (Table 3), suggesting that the solver works to some extent.
  • The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3×3 filters as in Fig. 3.
  • In the first comparison (Table 2 and Fig. 4 right), the authors use identity mapping for all shortcuts and zero-padding for increasing dimensions.
  • The authors note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs left).
  • For each residual function F, the authors use a stack of 3 layers instead of 2 (Fig. 5).
Conclusion
  • The depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 billion FLOPs).
  • The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper.
  • The authors further explore n = 18 that leads to a 110-layer ResNet. In this case, the authors find that the initial learning rate of 0.1 is slightly too large to start converging5.
  • The authors' method shows no optimization difficulty, and this 103-layer network is able to achieve training error <0.1% (Fig. 6, right).
Summary
  • Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 49, 39].
  • The shortcut connections perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2).
  • On the ImageNet classification dataset [35], the authors obtain excellent results by extremely deep residual nets.
  • The papers of [38, 37, 31, 46] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections.
  • When a gated shortcut is “closed”, the layers in highway networks represent non-residual functions.
  • As the authors discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart.
  • The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers.
  • With the residual learning reformulation, if identity mappings are optimal, the solvers may drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.
  • The authors show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.
  • Based on the above plain network, the authors insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version.
  • The 34-layer plain net is still able to achieve competitive accuracy (Table 3), suggesting that the solver works to some extent.
  • The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3×3 filters as in Fig. 3.
  • In the first comparison (Table 2 and Fig. 4 right), the authors use identity mapping for all shortcuts and zero-padding for increasing dimensions.
  • The authors note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs left).
  • For each residual function F, the authors use a stack of 3 layers instead of 2 (Fig. 5).
  • The depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 billion FLOPs).
  • The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper.
  • The authors further explore n = 18 that leads to a 110-layer ResNet. In this case, the authors find that the initial learning rate of 0.1 is slightly too large to start converging5.
  • The authors' method shows no optimization difficulty, and this 103-layer network is able to achieve training error <0.1% (Fig. 6, right).
Tables
  • Table1: Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Downsampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2
  • Table2: Top-1 error (%, 10-crop testing) on ImageNet validation. Here the ResNets have no extra parameter compared to their plain counterparts
  • Table3: Error rates (%, 10-crop testing) on ImageNet validation. VGG-16 is based on our test. ResNet-50/101/152 are of option B that only uses projections for increasing dimensions
  • Table4: Error rates (%) of single-model results on the ImageNet validation set (except † reported on the test set)
  • Table5: Error rates (%) of ensembles. The top-5 error is on the test set of ImageNet and reported by the test server
  • Table6: Classification error on the CIFAR-10 test set. All methods are with data augmentation. For ResNet-110, we run it 5 times and show “best (mean±std)” as in [<a class="ref-link" id="c42" href="#r42">42</a>]
  • Table7: Object detection mAP (%) on the PASCAL VOC 2007/2012 test sets using baseline Faster R-CNN. See also appendix for better results
  • Table8: Object detection mAP (%) on the COCO validation set using baseline Faster R-CNN. See also appendix for better results
Download tables as Excel
Related work
  • Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 47]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.

    In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis preconditioning [44, 45], which relies on variables that represent residual vectors between two scales. It has been shown [3, 44, 45] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.
Funding
  • Presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously
  • Provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth
  • Evaluates residual nets with a depth of up to 152 layers—8× deeper than VGG nets but still having lower complexity
  • Addresses the degradation problem by introducing a deep residual learning framework
Study subjects and analysis
major observations: 3
So they have no extra parameter compared to the plain counterparts. We have three major observations from Table 2 and Fig. 4. First, the situation is reversed with residual learning – the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%)

Reference
  • Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
    Google ScholarLocate open access versionFindings
  • C. M. Bishop. Neural networks for pattern recognition. Oxford university press, 1995.
    Google ScholarFindings
  • W. L. Briggs, S. F. McCormick, et al. A Multigrid Tutorial. Siam, 2000.
    Google ScholarLocate open access versionFindings
  • K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011.
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, pages 303–338, 2010.
    Google ScholarLocate open access versionFindings
  • R. Girshick. Fast R-CNN. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
    Google ScholarLocate open access versionFindings
  • I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv:1302.4389, 2013.
    Findings
  • K. He and J. Sun. Convolutional neural networks at constrained time cost. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv:1207.0580, 2012.
    Findings
  • S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma thesis, TU Munich, 1991.
    Google ScholarFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. TPAMI, 33, 2011.
    Google ScholarLocate open access versionFindings
  • H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. Aggregating local image descriptors into compact codes. TPAMI, 2012.
    Google ScholarLocate open access versionFindings
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
    Findings
  • A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
    Google ScholarFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
    Google ScholarFindings
  • Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller. Efficient backprop. In Neural Networks: Tricks of the Trade, pages 9–50.
    Google ScholarLocate open access versionFindings
  • C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. arXiv:1409.5185, 2014.
    Findings
  • M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013.
    Findings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014.
    Google ScholarLocate open access versionFindings
  • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • G. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
    Google ScholarLocate open access versionFindings
  • F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.
    Google ScholarLocate open access versionFindings
  • T. Raiko, H. Valpola, and Y. LeCun. Deep learning made easier by linear transformations in perceptrons. In AISTATS, 2012.
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • B. D. Ripley. Pattern recognition and neural networks. Cambridge university press, 1996.
    Google ScholarFindings
  • A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014.
    Findings
  • A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.
    Findings
  • N. N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical report, 1998.
    Google ScholarFindings
  • [39] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • [40] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • [41] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv:1505.00387, 2015.
    Findings
  • [42] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. 1507.06228, 2015.
    Google ScholarFindings
  • [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • [44] R. Szeliski. Fast surface interpolation using hierarchical basis functions. TPAMI, 1990.
    Google ScholarLocate open access versionFindings
  • [45] R. Szeliski. Locally adapted hierarchical basis preconditioning. In SIGGRAPH, 2006.
    Google ScholarLocate open access versionFindings
  • [46] T. Vatanen, T. Raiko, H. Valpola, and Y. LeCun. Pushing stochastic gradient towards second-order methods–backpropagation learning with transformations in nonlinearities. In Neural Information Processing, 2013.
    Google ScholarLocate open access versionFindings
  • [47] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008.
    Google ScholarFindings
  • [48] W. Venables and B. Ripley. Modern applied statistics with s-plus. 1999.
    Google ScholarFindings
  • [49] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Best Paper
Best Paper of CVPR, 2016
Tags
Comments