Group Normalization

    International Journal of Computer Vision, pp. 742-755, 2020.

    Cited by: 580|Bibtex|Views123|Links
    EI
    Keywords:
    Layer Normalizationdeep networkBatch RenormalizationLocal Response Normalizationneural networkMore(16+)
    Wei bo:
    That Batch Norm has been so influential that many state-of-the-art systems and their hyper-parameters have been designed for it, which may not be optimal for Group Norm-based models

    Abstract:

    Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN's usage for t...More

    Code:

    Data:

    0
    Introduction
    • Batch Normalization (Batch Norm or BN) [1] has been established as a very effective component in deep learning, largely helping push the frontier in computer vision [2,3] and beyond [4].
    • BN normalizes the features by the mean and variance computed within abatch.
    • This has been shown by many practices to ease optimization and enable very deep networks to converge.
    • The stochastic uncertainty of the batch statistics acts as a regularizer that can benefit generalization.
    • It is required for BN to work with a sufficiently large batch size (e.g., 32 per worker1 [1,2,3]).
    • A small batch leads to inaccurate estimation of the batch statistics, and reducing BN’s batch size increases the model error dramatically (Figure 1)
    Highlights
    • Batch Normalization (Batch Norm or Batch Norm) [1] has been established as a very effective component in deep learning, largely helping push the frontier in computer vision [2,3] and beyond [4]
    • To study Group Norm/Batch Norm compared to no normalization, we consider VGG-16 [57] that can be healthily trained without normalization layers
    • Group Norm improves over Batch Norm* by 1.1 box Average Precision and 0.8 mask Average Precision
    • On the contrary, applying Batch Norm to the box head does not provide satisfactory result and is ∼9 Average Precision worse — in detection, the batch of RoIs are sampled from the same image and their distribution is not i.i.d., and the non-i.i.d. distribution is an issue that degrades Batch Norm’s batch statistics estimation [35]
    • We have presented Group Norm as an effective normalization layer without exploiting the batch dimension
    • On ResNet-50 trained in ImageNet, Group Norm has 10.6% lower error than its Batch Norm counterpart when using a batch size of 2; when using typical batch sizes, Group Norm is comparably good with Batch Norm and outperforms other normalization variants
    • That Batch Norm has been so influential that many state-of-the-art systems and their hyper-parameters have been designed for it, which may not be optimal for Group Norm-based models
    Methods
    • 4.1 Image Classification in ImageNet. Implementation details.
    • As standard practice [3,52], the authors use 8 GPUs to train all models, and the batch mean and variance of BN are computed within each GPU.
    • The authors use 1 to initialize all γ parameters, except for each residual block’s last normalization layer where the authors initialize γ by 0 following [54].
    • The authors train 100 epochs for all models, and decrease the learning rate by 10× at 30, 60, and 90 epochs.
    • Other implementation details follow [52]
    Results
    • Results and analysis of VGG models

      To study GN/BN compared to no normalization, the authors consider VGG-16 [57] that can be healthily trained without normalization layers.
    • Table 4 shows the comparison of GN vs BN* on Mask R-CNN using a conv4 backbone (“C4” [10])
    • This C4 variant uses ResNet’s layers of up to conv4 to extract feature maps, and ResNet’s conv5 layers as the Region-of-Interest (RoI) heads for classification and regression.
    • As they are inherited from the pre-trained model, the backbone and head both involve normalization layers
    • On this baseline, GN improves over BN* by 1.1 box AP and 0.8 mask AP.
    Conclusion
    • The authors have presented GN as an effective normalization layer without exploiting the batch dimension.
    • The authors have evaluated GN’s behaviors in a variety of applications.
    • That BN has been so influential that many state-of-the-art systems and their hyper-parameters have been designed for it, which may not be optimal for GN-based models.
    • It is possible that re-designing the systems or searching new hyper-parameters for GN will give better results
    Summary
    • Introduction:

      Batch Normalization (Batch Norm or BN) [1] has been established as a very effective component in deep learning, largely helping push the frontier in computer vision [2,3] and beyond [4].
    • BN normalizes the features by the mean and variance computed within abatch.
    • This has been shown by many practices to ease optimization and enable very deep networks to converge.
    • The stochastic uncertainty of the batch statistics acts as a regularizer that can benefit generalization.
    • It is required for BN to work with a sufficiently large batch size (e.g., 32 per worker1 [1,2,3]).
    • A small batch leads to inaccurate estimation of the batch statistics, and reducing BN’s batch size increases the model error dramatically (Figure 1)
    • Methods:

      4.1 Image Classification in ImageNet. Implementation details.
    • As standard practice [3,52], the authors use 8 GPUs to train all models, and the batch mean and variance of BN are computed within each GPU.
    • The authors use 1 to initialize all γ parameters, except for each residual block’s last normalization layer where the authors initialize γ by 0 following [54].
    • The authors train 100 epochs for all models, and decrease the learning rate by 10× at 30, 60, and 90 epochs.
    • Other implementation details follow [52]
    • Results:

      Results and analysis of VGG models

      To study GN/BN compared to no normalization, the authors consider VGG-16 [57] that can be healthily trained without normalization layers.
    • Table 4 shows the comparison of GN vs BN* on Mask R-CNN using a conv4 backbone (“C4” [10])
    • This C4 variant uses ResNet’s layers of up to conv4 to extract feature maps, and ResNet’s conv5 layers as the Region-of-Interest (RoI) heads for classification and regression.
    • As they are inherited from the pre-trained model, the backbone and head both involve normalization layers
    • On this baseline, GN improves over BN* by 1.1 box AP and 0.8 mask AP.
    • Conclusion:

      The authors have presented GN as an effective normalization layer without exploiting the batch dimension.
    • The authors have evaluated GN’s behaviors in a variety of applications.
    • That BN has been so influential that many state-of-the-art systems and their hyper-parameters have been designed for it, which may not be optimal for GN-based models.
    • It is possible that re-designing the systems or searching new hyper-parameters for GN will give better results
    Tables
    • Table1: Comparison of error rates with a batch size of 32 images/GPU, on ResNet-50 in the ImageNet validation set. The error curves are in Figure 4
    • Table2: Sensitivity to batch sizes. We show ResNet-50’s validation error (%) in ImageNet. The last row shows the differences between BN and GN. The error curves are in Figure 5. This table is visualized in Figure 1
    • Table3: Group division. We show ResNet-50’s validation error (%) in ImageNet, trained with 32 images/GPU. (Left): a given number of groups. (Right): a given number of channels per group. The last rows show the differences with the best number
    • Table4: Detection and segmentation results in COCO, using Mask R-CNN with the ResNet-50 C4 backbone. BN* means BN is frozen
    • Table5: Detection and segmentation results in COCO, using Mask R-CNN with ResNet-50 FPN and a 4conv1fc bounding box head. BN* means BN is frozen
    • Table6: Detection and segmentation results in COCO using Mask R-CNN and FPN. Here BN* is the default Detectron baseline [<a class="ref-link" id="c59" href="#r59">59</a>], and GN is applied to the backbone, box head, and mask head. “long” means training with more iterations
    • Table7: COCO models trained from scratch using Mask R-CNN and FPN
    • Table8: Video classification in Kinetics: ResNet-50 I3D’s top-1/5 accuracy (%)
    Download tables as Excel
    Related work
    • Normalization. Normalization layers in deep networks had been widely used before the development of BN. Local Response Normalization (LRN) [26,27,28] was a component in AlexNet [28] and following models [29,30,31]. LRN computes the statistics in a small neighborhood for each pixel.

      Batch Normalization [1] performs more global normalization along the batch dimension (and as importantly, it suggests to do this for all layers). But the concept of “batch” is not always present, or it may change from time to time. For example, batch-wise normalization is not legitimate at inference time, so the mean and variance are pre-computed from the training set [1], often by running average; consequently, there is no normalization performed when testing. The pre-computed statistics may also change when the target data distribution changes [32]. These issues lead to inconsistency at training, transferring, and testing time. In addition, as aforementioned, reducing the batch size can have dramatic impact on the estimated batch statistics.
    Reference
    • Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML. (2015)
      Google ScholarFindings
    • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR. (2016)
      Google ScholarFindings
    • He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016)
      Google ScholarFindings
    • Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of go without human knowledge. Nature (2017)
      Google ScholarLocate open access versionFindings
    • Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: ICLR Workshop. (2016)
      Google ScholarFindings
    • Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR. (2017)
      Google ScholarFindings
    • Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR. (2017)
      Google ScholarFindings
    • Girshick, R.: Fast R-CNN. In: ICCV. (2015)
      Google ScholarFindings
    • Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS. (2015)
      Google ScholarFindings
    • He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV. (2017)
      Google ScholarFindings
    • Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. (2015)
      Google ScholarFindings
    • Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV. (2015)
      Google ScholarFindings
    • Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR. (2017)
      Google ScholarFindings
    • Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
      Google ScholarLocate open access versionFindings
    • Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR. (2005)
      Google ScholarFindings
    • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. IJCV (2015)
      Google ScholarLocate open access versionFindings
    • Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv:1607.06450 (2016)
      Findings
    • Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022 (2016)
      Findings
    • Salimans, T., Kingma, D.P.: Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In: NIPS. (2016)
      Google ScholarFindings
    • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: ECCV. (2014)
      Google ScholarLocate open access versionFindings
    • Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The Kinetics human action video dataset. arXiv:1705.06950 (2017)
      Findings
    • Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Nature (1986)
      Google ScholarLocate open access versionFindings
    • Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (1997)
      Google ScholarLocate open access versionFindings
    • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS. (2014)
      Google ScholarFindings
    • Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR. (2017)
      Google ScholarFindings
    • Lyu, S., Simoncelli, E.P.: Nonlinear image representation using divisive normalization. In: CVPR. (2008)
      Google ScholarFindings
    • Jarrett, K., Kavukcuoglu, K., LeCun, Y., et al.: What is the best multi-stage architecture for object recognition? In: ICCV. (2009)
      Google ScholarFindings
    • Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: NIPS. (2012)
      Google ScholarFindings
    • Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural networks. In: ECCV. (2014)
      Google ScholarFindings
    • Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. In: ICLR. (2014)
      Google ScholarFindings
    • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015)
      Google ScholarFindings
    • Rebuffi, S.A., Bilen, H., Vedaldi, A.: Learning multiple visual domains with residual adapters. In: NIPS. (2017)
      Google ScholarFindings
    • Arpit, D., Zhou, Y., Kota, B., Govindaraju, V.: Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. In: ICML. (2016)
      Google ScholarFindings
    • Ren, M., Liao, R., Urtasun, R., Sinz, F.H., Zemel, R.S.: Normalizing the normalizers: Comparing and extending network normalization schemes. In: ICLR. (2017)
      Google ScholarFindings
    • Ioffe, S.: Batch renormalization: Towards reducing minibatch dependence in batchnormalized models. In: NIPS. (2017)
      Google ScholarFindings
    • Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., Yu, G., Sun, J.: MegDet: A large mini-batch object detector. In: CVPR. (2018)
      Google ScholarFindings
    • Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. In: NIPS. (2012)
      Google ScholarFindings
    • Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)
      Findings
    • Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: CVPR. (2017)
      Google ScholarFindings
    • Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In: CVPR. (2018)
      Google ScholarFindings
    • Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV (2001)
      Google ScholarLocate open access versionFindings
    • Jegou, H., Douze, M., Schmid, C., Perez, P.: Aggregating local descriptors into a compact image representation. In: CVPR. (2010)
      Google ScholarFindings
    • Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: CVPR. (2007)
      Google ScholarFindings
    • Dieleman, S., De Fauw, J., Kavukcuoglu, K.: Exploiting cyclic symmetry in convolutional neural networks. In: ICML. (2016)
      Google ScholarFindings
    • Cohen, T., Welling, M.: Group equivariant convolutional networks. In: ICML. (2016)
      Google ScholarFindings
    • Heeger, D.J.: Normalization of cell responses in cat striate cortex. Visual neuroscience (1992)
      Google ScholarFindings
    • Schwartz, O., Simoncelli, E.P.: Natural signal statistics and sensory gain control. Nature neuroscience (2001)
      Google ScholarLocate open access versionFindings
    • Simoncelli, E.P., Olshausen, B.A.: Natural image statistics and neural representation. Annual review of neuroscience (2001)
      Google ScholarLocate open access versionFindings
    • Carandini, M., Heeger, D.J.: Normalization as a canonical neural computation. Nature Reviews Neuroscience (2012)
      Google ScholarLocate open access versionFindings
    • Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. (2017)
      Google ScholarFindings
    • Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: Operating Systems Design and Implementation (OSDI). (2016)
      Google ScholarLocate open access versionFindings
    • Gross, S., Wilber, M.: Training and investigating Residual Nets. https://github.com/facebook/fb.resnet.torch (2016)
      Findings
    • He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In: ICCV. (2015)
      Google ScholarFindings
    • Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677 (2017)
      Findings
    • Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997 (2014)
      Findings
    • Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv:1606.04838 (2016)
      Findings
    • Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR. (2015)
      Google ScholarFindings
    • Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: ICCV. (2017)
      Google ScholarFindings
    • Girshick, R., Radosavovic, I., Gkioxari, G., Dollar, P., He, K.: Detectron. https://github.com/facebookresearch/detectron (2018)
      Findings
    • Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR. (2017)
      Google ScholarFindings
    • Ren, S., He, K., Girshick, R., Zhang, X., Sun, J.: Object detection networks on convolutional feature maps. TPAMI (2017)
      Google ScholarLocate open access versionFindings
    • Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J.: DetNet: A backbone network for object detection. arXiv:1804.06215 (2018)
      Findings
    • Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR.
      Google ScholarFindings
    Your rating :
    0

     

    Tags
    Comments