# Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

ICCV, 2015.

EI

Weibo:

Abstract:

Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero ...More

Code:

Data:

Full Text

Weibo

Introduction

- Convolutional neural networks (CNNs) [19, 18] have demonstrated recognition accuracy better than or comparable to humans in several visual recognition tasks, including recognizing traffic signs [3], faces [34, 32], and handwritten digits [3, 36].
- Techniques [13, 30, 10, 36], aggressive data augmentation [18, 14, 29, 33], and large-scale data [4, 26]
- Among these advances, the rectifier neuron [24, 9, 23, 38], e.g., Rectified Linear Unit (ReLU), is one of several keys to the recent success of deep networks [18].
- Despite the prevalence of rectifier networks, recent improvements of models [37, 28, 12, 29, 33] and theoretical guidelines for training them [8, 27] have rarely focused on the properties of the rectifiers

Highlights

- Convolutional neural networks (CNNs) [19, 18] have demonstrated recognition accuracy better than or comparable to humans in several visual recognition tasks, including recognizing traffic signs [3], faces [34, 32], and handwritten digits [3, 36]
- We present a result that surpasses the human-level performance reported by [26] on a more generic and challenging recognition task - the classification task in the 1000-class ImageNet dataset [26]
- Neural networks are becoming more capable of fitting training data, because of increased complexity, new nonlinear activations [24, 23, 38, 22, 31, 10], and sophisticated layer designs [33, 12]
- We investigate neural networks from two aspects driven by the rectifier properties
- We show the result of LReLU with a = 0.25 in Table 2, which is no better than Rectified Linear Unit
- We investigate how Parametric Rectified Linear Unit may affect training via computing the Fisher Information Matrix (FIM)

Methods

**Experiments on ImageNet**

The authors perform the experiments on the 1000-class ImageNet 2012 dataset [26] which contains about 1.2 million training images, 50,000 validation images, and 100,000 test images.- In Table 4, the authors compare ReLU and PReLU on the large model A.
- The authors use the channel-wise version of PReLU.
- For fair comparisons, both ReLU/PReLU models are trained using the same total number of epochs, and the learning rates are switched after running the same number of epochs.
- Figure 4 shows the train/val error during training.
- PReLU has lower train error and val error than ReLU throughout the training procedure

Results

- Based on the learnable activation and advanced initialization, the authors achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset.
- The authors' result is 1.7% better than the ILSVRC 2014 winner (GoogLeNet, 6.66% [33]), which represents a 26% relative improvement.4

Conclusion

**Discussion on Rectifiers The analysis in**

Sec. 2.1 and 2.2 involves the “rectified”

units that are asymmetric activation functions, unlike many activations that are symmetric.**Discussion on Rectifiers The analysis in**.- Sec. 2.1 and 2.2 involves the “rectified”.
- Units that are asymmetric activation functions, unlike many activations that are symmetric.
- This leads to some fundamental differences.
- The conclusions involving Eqn(6) and Eqn(9) are heavily biased by the fact that E[f ] is greater than zero in the case of ReLU.
- The asymmetric behavior requires algorithmic changes that take this effect into account.

Summary

## Introduction:

Convolutional neural networks (CNNs) [19, 18] have demonstrated recognition accuracy better than or comparable to humans in several visual recognition tasks, including recognizing traffic signs [3], faces [34, 32], and handwritten digits [3, 36].- Techniques [13, 30, 10, 36], aggressive data augmentation [18, 14, 29, 33], and large-scale data [4, 26]
- Among these advances, the rectifier neuron [24, 9, 23, 38], e.g., Rectified Linear Unit (ReLU), is one of several keys to the recent success of deep networks [18].
- Despite the prevalence of rectifier networks, recent improvements of models [37, 28, 12, 29, 33] and theoretical guidelines for training them [8, 27] have rarely focused on the properties of the rectifiers
## Methods:

**Experiments on ImageNet**

The authors perform the experiments on the 1000-class ImageNet 2012 dataset [26] which contains about 1.2 million training images, 50,000 validation images, and 100,000 test images.- In Table 4, the authors compare ReLU and PReLU on the large model A.
- The authors use the channel-wise version of PReLU.
- For fair comparisons, both ReLU/PReLU models are trained using the same total number of epochs, and the learning rates are switched after running the same number of epochs.
- Figure 4 shows the train/val error during training.
- PReLU has lower train error and val error than ReLU throughout the training procedure
## Results:

Based on the learnable activation and advanced initialization, the authors achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset.- The authors' result is 1.7% better than the ILSVRC 2014 winner (GoogLeNet, 6.66% [33]), which represents a 26% relative improvement.4
## Conclusion:

**Discussion on Rectifiers The analysis in**

Sec. 2.1 and 2.2 involves the “rectified”

units that are asymmetric activation functions, unlike many activations that are symmetric.**Discussion on Rectifiers The analysis in**.- Sec. 2.1 and 2.2 involves the “rectified”.
- Units that are asymmetric activation functions, unlike many activations that are symmetric.
- This leads to some fundamental differences.
- The conclusions involving Eqn(6) and Eqn(9) are heavily biased by the fact that E[f ] is greater than zero in the case of ReLU.
- The asymmetric behavior requires algorithmic changes that take this effect into account.

- Table1: A small but deep 14-layer model [<a class="ref-link" id="c11" href="#r11">11</a>]. The filter size and filter number of each layer is listed. The number /s indicates the stride s that is used. The learned coefficients of PReLU are also shown. For the channel-wise case, the average of {ai} over the channels is shown for each layer. Figure 3(right) shows the convergence of the 30-
- Table2: Comparisons between ReLU, LReLU, and PReLU on the small model. The error rates are for ImageNet 2012 using 10view testing. The images are resized so that the shorter side is 256, during both training and testing. Each view is 224×224. All models are trained using 75 epochs
- Table3: Architectures of large models. Here “/2” denotes a stride of 2. The “spp” layer [<a class="ref-link" id="c12" href="#r12">12</a>] produces a 4-level {7, 3, 2, 1} pyramid. The complexity (comp.) is operations in 1010
- Table4: Comparisons between ReLU/PReLU on model A in ImageNet 2012 using dense testing
- Table5: Single-model 10-view results for ImageNet 2012 val set. †: Based on our tests
- Table6: Single-model results for ImageNet 2012 val set. †: Evaluated from the test set
- Table7: Multi-model results for the ImageNet 2012 test set
- Table8: Object detection mAP on PASCAL VOC 2007 using Fast R-CNN [<a class="ref-link" id="c6" href="#r6">6</a>] on different pre-trained nets

Funding

- Studies rectifier neural networks for image classification from two aspects
- Achieves 4.94% top-5 test error on the ImageNet 2012 classification dataset
- Presents a result that surpasses the human-level performance reported by on a more generic and challenging recognition task - the classification task in the 1000-class ImageNet dataset
- Investigates neural networks from two aspects driven by the rectifier properties
- Shows that replacing the parameter-free ReLU by a learned activation unit improves classification accuracy2

Reference

- F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi. Learning activation functions to improve deep neural networks. arXiv:1412.6830, 2014.
- K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014.
- D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In CVPR, 2012.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, pages 303–338, 2010.
- R. Girshick. Fast R-CNN. arXiv:1504.08083, 2015.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, 2010.
- X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
- I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv:1302.4389, 2013.
- K. He and J. Sun. Convolutional neural networks at constrained time cost. arXiv:1412.1710, 2014.
- K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. arXiv:1406.4729v2, 2014.
- G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.
- A. G. Howard. Some improvements on deep convolutional neural network based image classification. arXiv:1312.5402, 2013.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
- A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997, 2014.
- A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
- Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller. Efficient backprop. In Neural Networks: Tricks of the Trade, pages 9–50.
- C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. arXiv:1409.5185, 2014.
- M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013.
- A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013.
- V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.
- T. Raiko, H. Valpola, and Y. LeCun. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics, 2012.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014.
- A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.
- P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. 2014.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, pages 1929–1958, 2014.
- R. K. Srivastava, J. Masci, S. Kazerounian, F. Gomez, and J. Schmidhuber. Compete to compute. In NIPS, pages 2310– 2318, 2013.
- Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In NIPS, 2014.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv:1409.4842, 2014.
- Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In CVPR, 2014.
- T. Vatanen, T. Raiko, H. Valpola, and Y. LeCun. Pushing stochastic gradient towards second-order methods– backpropagation learning with transformations in nonlinearities. In Neural Information Processing, pages 442–449.
- L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In ICML, pages 1058–1066, 2013.
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. In ECCV, 2014.
- M. D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q. V. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, and G. E. Hinton. On rectified linear units for speech processing. In ICASSP, 2013.

Tags

Comments