Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

ICCV, 2015.

Cited by: 8340|Bibtex|Views313|Links
EI
Keywords:
linear unitrectified linear unithuman level performancerectifi nonlinearitieneural networkMore(11+)
Weibo:
We investigate how Parametric Rectified Linear Unit may affect training via computing the Fisher Information Matrix

Abstract:

Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero ...More

Code:

Data:

0
Introduction
  • Convolutional neural networks (CNNs) [19, 18] have demonstrated recognition accuracy better than or comparable to humans in several visual recognition tasks, including recognizing traffic signs [3], faces [34, 32], and handwritten digits [3, 36].
  • Techniques [13, 30, 10, 36], aggressive data augmentation [18, 14, 29, 33], and large-scale data [4, 26]
  • Among these advances, the rectifier neuron [24, 9, 23, 38], e.g., Rectified Linear Unit (ReLU), is one of several keys to the recent success of deep networks [18].
  • Despite the prevalence of rectifier networks, recent improvements of models [37, 28, 12, 29, 33] and theoretical guidelines for training them [8, 27] have rarely focused on the properties of the rectifiers
Highlights
  • Convolutional neural networks (CNNs) [19, 18] have demonstrated recognition accuracy better than or comparable to humans in several visual recognition tasks, including recognizing traffic signs [3], faces [34, 32], and handwritten digits [3, 36]
  • We present a result that surpasses the human-level performance reported by [26] on a more generic and challenging recognition task - the classification task in the 1000-class ImageNet dataset [26]
  • Neural networks are becoming more capable of fitting training data, because of increased complexity, new nonlinear activations [24, 23, 38, 22, 31, 10], and sophisticated layer designs [33, 12]
  • We investigate neural networks from two aspects driven by the rectifier properties
  • We show the result of LReLU with a = 0.25 in Table 2, which is no better than Rectified Linear Unit
  • We investigate how Parametric Rectified Linear Unit may affect training via computing the Fisher Information Matrix (FIM)
Methods
  • Experiments on ImageNet

    The authors perform the experiments on the 1000-class ImageNet 2012 dataset [26] which contains about 1.2 million training images, 50,000 validation images, and 100,000 test images.
  • In Table 4, the authors compare ReLU and PReLU on the large model A.
  • The authors use the channel-wise version of PReLU.
  • For fair comparisons, both ReLU/PReLU models are trained using the same total number of epochs, and the learning rates are switched after running the same number of epochs.
  • Figure 4 shows the train/val error during training.
  • PReLU has lower train error and val error than ReLU throughout the training procedure
Results
  • Based on the learnable activation and advanced initialization, the authors achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset.
  • The authors' result is 1.7% better than the ILSVRC 2014 winner (GoogLeNet, 6.66% [33]), which represents a 26% relative improvement.4
Conclusion
  • Discussion on Rectifiers The analysis in

    Sec. 2.1 and 2.2 involves the “rectified”

    units that are asymmetric activation functions, unlike many activations that are symmetric.
  • Discussion on Rectifiers The analysis in.
  • Sec. 2.1 and 2.2 involves the “rectified”.
  • Units that are asymmetric activation functions, unlike many activations that are symmetric.
  • This leads to some fundamental differences.
  • The conclusions involving Eqn(6) and Eqn(9) are heavily biased by the fact that E[f ] is greater than zero in the case of ReLU.
  • The asymmetric behavior requires algorithmic changes that take this effect into account.
Summary
  • Introduction:

    Convolutional neural networks (CNNs) [19, 18] have demonstrated recognition accuracy better than or comparable to humans in several visual recognition tasks, including recognizing traffic signs [3], faces [34, 32], and handwritten digits [3, 36].
  • Techniques [13, 30, 10, 36], aggressive data augmentation [18, 14, 29, 33], and large-scale data [4, 26]
  • Among these advances, the rectifier neuron [24, 9, 23, 38], e.g., Rectified Linear Unit (ReLU), is one of several keys to the recent success of deep networks [18].
  • Despite the prevalence of rectifier networks, recent improvements of models [37, 28, 12, 29, 33] and theoretical guidelines for training them [8, 27] have rarely focused on the properties of the rectifiers
  • Methods:

    Experiments on ImageNet

    The authors perform the experiments on the 1000-class ImageNet 2012 dataset [26] which contains about 1.2 million training images, 50,000 validation images, and 100,000 test images.
  • In Table 4, the authors compare ReLU and PReLU on the large model A.
  • The authors use the channel-wise version of PReLU.
  • For fair comparisons, both ReLU/PReLU models are trained using the same total number of epochs, and the learning rates are switched after running the same number of epochs.
  • Figure 4 shows the train/val error during training.
  • PReLU has lower train error and val error than ReLU throughout the training procedure
  • Results:

    Based on the learnable activation and advanced initialization, the authors achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset.
  • The authors' result is 1.7% better than the ILSVRC 2014 winner (GoogLeNet, 6.66% [33]), which represents a 26% relative improvement.4
  • Conclusion:

    Discussion on Rectifiers The analysis in

    Sec. 2.1 and 2.2 involves the “rectified”

    units that are asymmetric activation functions, unlike many activations that are symmetric.
  • Discussion on Rectifiers The analysis in.
  • Sec. 2.1 and 2.2 involves the “rectified”.
  • Units that are asymmetric activation functions, unlike many activations that are symmetric.
  • This leads to some fundamental differences.
  • The conclusions involving Eqn(6) and Eqn(9) are heavily biased by the fact that E[f ] is greater than zero in the case of ReLU.
  • The asymmetric behavior requires algorithmic changes that take this effect into account.
Tables
  • Table1: A small but deep 14-layer model [<a class="ref-link" id="c11" href="#r11">11</a>]. The filter size and filter number of each layer is listed. The number /s indicates the stride s that is used. The learned coefficients of PReLU are also shown. For the channel-wise case, the average of {ai} over the channels is shown for each layer. Figure 3(right) shows the convergence of the 30-
  • Table2: Comparisons between ReLU, LReLU, and PReLU on the small model. The error rates are for ImageNet 2012 using 10view testing. The images are resized so that the shorter side is 256, during both training and testing. Each view is 224×224. All models are trained using 75 epochs
  • Table3: Architectures of large models. Here “/2” denotes a stride of 2. The “spp” layer [<a class="ref-link" id="c12" href="#r12">12</a>] produces a 4-level {7, 3, 2, 1} pyramid. The complexity (comp.) is operations in 1010
  • Table4: Comparisons between ReLU/PReLU on model A in ImageNet 2012 using dense testing
  • Table5: Single-model 10-view results for ImageNet 2012 val set. †: Based on our tests
  • Table6: Single-model results for ImageNet 2012 val set. †: Evaluated from the test set
  • Table7: Multi-model results for the ImageNet 2012 test set
  • Table8: Object detection mAP on PASCAL VOC 2007 using Fast R-CNN [<a class="ref-link" id="c6" href="#r6">6</a>] on different pre-trained nets
Download tables as Excel
Funding
  • Studies rectifier neural networks for image classification from two aspects
  • Achieves 4.94% top-5 test error on the ImageNet 2012 classification dataset
  • Presents a result that surpasses the human-level performance reported by on a more generic and challenging recognition task - the classification task in the 1000-class ImageNet dataset
  • Investigates neural networks from two aspects driven by the rectifier properties
  • Shows that replacing the parameter-free ReLU by a learned activation unit improves classification accuracy2
Reference
  • F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi. Learning activation functions to improve deep neural networks. arXiv:1412.6830, 2014.
    Findings
  • K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014.
    Google ScholarLocate open access versionFindings
  • D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In CVPR, 2012.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, pages 303–338, 2010.
    Google ScholarLocate open access versionFindings
  • R. Girshick. Fast R-CNN. arXiv:1504.08083, 2015.
    Findings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, 2010.
    Google ScholarLocate open access versionFindings
  • X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
    Google ScholarLocate open access versionFindings
  • I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv:1302.4389, 2013.
    Findings
  • K. He and J. Sun. Convolutional neural networks at constrained time cost. arXiv:1412.1710, 2014.
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. arXiv:1406.4729v2, 2014.
    Findings
  • G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.
    Findings
  • A. G. Howard. Some improvements on deep convolutional neural network based image classification. arXiv:1312.5402, 2013.
    Findings
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.
    Findings
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
    Findings
  • A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997, 2014.
    Findings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
    Google ScholarFindings
  • Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller. Efficient backprop. In Neural Networks: Tricks of the Trade, pages 9–50.
    Google ScholarLocate open access versionFindings
  • C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. arXiv:1409.5185, 2014.
    Findings
  • M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013.
    Findings
  • A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013.
    Google ScholarFindings
  • V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.
    Google ScholarLocate open access versionFindings
  • T. Raiko, H. Valpola, and Y. LeCun. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics, 2012.
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014.
    Findings
  • A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.
    Findings
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. 2014.
    Google ScholarFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
    Findings
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, pages 1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • R. K. Srivastava, J. Masci, S. Kazerounian, F. Gomez, and J. Schmidhuber. Compete to compute. In NIPS, pages 2310– 2318, 2013.
    Google ScholarLocate open access versionFindings
  • Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv:1409.4842, 2014.
    Findings
  • Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • T. Vatanen, T. Raiko, H. Valpola, and Y. LeCun. Pushing stochastic gradient towards second-order methods– backpropagation learning with transformations in nonlinearities. In Neural Information Processing, pages 442–449.
    Google ScholarLocate open access versionFindings
  • L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In ICML, pages 1058–1066, 2013.
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q. V. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, and G. E. Hinton. On rectified linear units for speech processing. In ICASSP, 2013.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments