# Channel Equilibrium Networks for Learning Deep Representation

ICML 2020, 2020.

Keywords:

instance reweightingresidual networksAverage Precisionfeature representationgeneralization abilityMore(16+)

Wei bo:

Abstract:

Convolutional Neural Networks (CNNs) are typically constructed by stacking multiple building blocks, each of which contains a normalization layer such as batch normalization (BN) and a rectified linear function such as ReLU. However, this work shows that the combination of normalization and rectified linear function leads to inhibited c...More

Code:

Data:

Introduction

- Normalization methods such as batch normalization (BN) (Ioffe & Szegedy, 2015), layer normalization (LN) (Ba et al, 2016) and instance normalization (IN) (Ulyanov et al, 2016) are important components for a wide range of tasks such as image classification (Ioffe & Szegedy, 2015), object detection (He et al, 2017a), and image generation (Miyato et al, 2018).
- The lottery hypothesis (Frankle & Carbin, 2018) found that when a CNN is over-parameterized, it always contains unimportant (“dead”) channels whose feature values are extremely small
- These inhibited channels could be pruned in training to reduce the model size, it would lead to the limited generalization ability of the network (Yu et al, 2018; He et al, 2017b).

Highlights

- Normalization methods such as batch normalization (BN) (Ioffe & Szegedy, 2015), layer normalization (LN) (Ba et al, 2016) and instance normalization (IN) (Ulyanov et al, 2016) are important components for a wide range of tasks such as image classification (Ioffe & Szegedy, 2015), object detection (He et al, 2017a), and image generation (Miyato et al, 2018)
- Suppose that every single channel aims to contribute to the learned feature representation, we show that decorrelating feature channels after the normalization method can be connected with the Nash Equilibrium for each instance
- We presented an effective and efficient network block, termed as Channel Equilibrium (CE)
- We show that Channel Equilibrium encourages channels at the same layer to contribute to learned feature representation, enhancing the generalization ability of the network
- Channel Equilibrium can be stacked between the normalization layer and the Rectified units, making it flexible to be integrated into various convolutional neural networks architectures
- We hope that the analyses of Channel Equilibrium could bring a new perspective for future work in architecture design

Methods

- Variance scale s in Eqn(8) in each training step.
- To make the output depend only on the input, deterministically in inference, the authors use the moving average to calculate the population estimate of Σ −.
- Where s and Σ are the variance scale and covariance calculated within each mini-batch during training, and m denotes the momentum of moving average.
- It is worth noting that since fixed during inference, the branch does not introduce extra costs in memory or computation except for a simple linear transformation ( )

Conclusion

- The authors presented an effective and efficient network block, termed as Channel Equilibrium (CE).
- The authors show that CE encourages channels at the same layer to contribute to learned feature representation, enhancing the generalization ability of the network.
- CE can be stacked between the normalization layer and the Rectified units, making it flexible to be integrated into various CNN architectures.
- The authors hope that the analyses of CE could bring a new perspective for future work in architecture design

Summary

## Introduction:

Normalization methods such as batch normalization (BN) (Ioffe & Szegedy, 2015), layer normalization (LN) (Ba et al, 2016) and instance normalization (IN) (Ulyanov et al, 2016) are important components for a wide range of tasks such as image classification (Ioffe & Szegedy, 2015), object detection (He et al, 2017a), and image generation (Miyato et al, 2018).- The lottery hypothesis (Frankle & Carbin, 2018) found that when a CNN is over-parameterized, it always contains unimportant (“dead”) channels whose feature values are extremely small
- These inhibited channels could be pruned in training to reduce the model size, it would lead to the limited generalization ability of the network (Yu et al, 2018; He et al, 2017b).
## Methods:

Variance scale s in Eqn(8) in each training step.- To make the output depend only on the input, deterministically in inference, the authors use the moving average to calculate the population estimate of Σ −.
- Where s and Σ are the variance scale and covariance calculated within each mini-batch during training, and m denotes the momentum of moving average.
- It is worth noting that since fixed during inference, the branch does not introduce extra costs in memory or computation except for a simple linear transformation ( )
## Conclusion:

The authors presented an effective and efficient network block, termed as Channel Equilibrium (CE).- The authors show that CE encourages channels at the same layer to contribute to learned feature representation, enhancing the generalization ability of the network.
- CE can be stacked between the normalization layer and the Rectified units, making it flexible to be integrated into various CNN architectures.
- The authors hope that the analyses of CE could bring a new perspective for future work in architecture design

- Table1: Comparisons with baseline and SENet on ResNet-18, -50, and -101 in terms of accuracy, GFLOPs, CPU and GPU inference time on ImageNet. The top-1,-5 accuracy of our CE-ResNet is higher than SE-ResNet while the computational cost in terms of GFLOPs, GPU and CPU inference time remain nearly the same. F. Ablative Experiments
- Table2: Comparisons with baseline and SE on lightweight networks, MobileNetv2 and ShuffleNetv2, in terms of accuracy and GFLOPs on ImageNet. Our CENet improves the top-1 accuracy by a large margin compared with SENet with nearly the same GFLOPs
- Table3: Detection and segmentation results in COCO using MaskRCNN We use the pretrained CE-ResNet50 model (78.3) and CE-ResNet101 (79.0) in ImageNet to train our model. CENet can consistently improve both box AP and segmentation AP by a large margin
- Table4: Ratios of (βc ≤ 0) after traing on various CNNs
- Table5: CE improves top-1 and top-5 accuracy of various normalization methods and rectified units on ImageNet with ResNet50 or ResNet18 as backbones
- Table6: Results of BD, IR and CE on Imagenet with ResNet-50 as the basic structure. The top-1 accuracy increase (1.7) of CE-ResNet is higher than combined top-1 accuracy increase (1.1) of BD-ResNet and IR-ResNet, indicating the effects of BD and IR branch is complementary
- Table7: Comparison between the proposed CE and other normalization method using decorrelation on ImageNet dataset. CE achieves higher top-1 accuracy on both ResBet50 and ResNet18
- Table8: We add CE after the second (CE2-ResNet50) and third (CE3-ResNet50) batch normalization layer in each residual block. The channel of the third batch normalization is 4 times than that of the second one but the top-1 accuracy of CE3-ResNet50 outperforms CE2-ResNet50 by 0.4, which indicates CE benefits from larger number of channels

Related work

- Sparsity in ReLU. An attractive property of ReLU (Sun et al, 2015; Nair & Hinton, 2010) is sparsity, which brings potential advantages such as information disentangling and linear separability. However, (Lu et al, 2019) and (Mehta et al, 2019) pointed out that some ReLU neurons may become inactive and output 0 values for any input. Previous work tackled this issue by designing new activation functions, such as ELU (Clevert et al, 2015) and Leaky ReLU (Maas et al, 2013). Recently, Lu et al (2019) also tried to solve this problem by modifying the initialization scheme. Different from these work, CE focus on explicitly preventing inhibited channel in a feed-forward way by encouraging channels at the same layer to contribute equally to learned feature representation. Normalization and decorrelation. There are many practices on normalizer development, such as Batch Normalization (BN) (Ioffe & Szegedy, 2015), Group normalization (GN) (Wu & He, 2018) and Switchable Normalization (Luo et al, 2018). A normalization scheme is typically applied after a convolution layer and contains two stages: standardization and rescaling. Another type of normalization methods not only standardizes but also decorrelates features, like DBN (Huang et al, 2018), IterNorm (Huang et al, 2019) and switchable whitening (Pan et al, 2019). Despite their success in stabilizing the training, little is explored about the relationship between these methods and inhibited channels. Fig.1 shows that inhibited channels emerge in VGGNet where ‘BN+ReLU’ or ‘LN+ReLU’ is used. Unlike previous decorrelated normalizations where decorrelation operation is applied after a convolution layer, our CE explicitly decorrelates features after normalization and is designed to prevent inhibited channels emerging in the block of normalization and rectified units.

Reference

- Arpit, D., Zhou, Y., Kota, B. U., and Govindaraju, V. Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. arXiv preprint arXiv:1603.01431, 2016.
- Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Bini, D. A., Higham, N. J., and Meini, B. Algorithms for the matrix pth root. Numerical Algorithms, 39(4):349–378, 2005.
- Cao, Y., Xu, J., Lin, S., Wei, F., and Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. arXiv preprint arXiv:1904.11492, 2019.
- Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
- Cover, T. M. and Thomas, J. A. Elements of information theory. John Wiley & Sons, 2012.
- Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
- Glorot, X., Bordes, A., and Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323, 2011.
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- He, K., Gkioxari, G., Dollar, P., and Girshick, R. Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017a.
- He, Y., Zhang, X., and Sun, J. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397, 2017b.
- Higham, N. J. Newton’s method for the matrix square root. Mathematics of Computation, 46(174):537–549, 1986.
- Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018.
- Huang, L., Yang, D., Lang, B., and Deng, J. Decorrelated batch normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 791–800, 2018.
- Huang, L., Zhou, Y., Zhu, F., Liu, L., and Shao, L. Iterative normalization: Beyond standardization towards efficient whitening. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4874–4883, 2019.
- Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.
- Laufer, A., Leshem, A., and Messer, H. Game theoretic aspects of distributed spectral coordination with application to dsl networks. arXiv preprint cs/0602014, 2006.
- Leshem, A. and Zehavi, E. Game theory and the frequency selective interference channel. IEEE Signal Processing Magazine, 26(5):28–40, 2009.
- Lu, L., Shin, Y., Su, Y., and Karniadakis, G. E. Dying relu and initialization: Theory and numerical examples. arXiv preprint arXiv:1903.06733, 2019.
- Luo, P., Ren, J., Peng, Z., Zhang, R., and Li, J. Differentiable learning-to-normalize via switchable normalization. arXiv preprint arXiv:1806.10779, 2018.
- Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131, 2018.
- Maas, A. L., Hannun, A. Y., and Ng, A. Y. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, pp. 3, 2013.
- Mehta, D., Kim, K. I., and Theobalt, C. On implicit filter level sparsity in convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 520–528, 2019.
- Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
- Morcos, A. S., Barrett, D. G., Rabinowitz, N. C., and Botvinick, M. On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959, 2018.
- Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
- Osborne, M. J. and Rubinstein, A. A course in game theory. MIT press, 1994.
- Pan, X., Zhan, X., Shi, J., Tang, X., and Luo, P. Switchable whitening for deep representation learning. Proceedings of the IEEE International Conference on Computer Vision, 2019.
- Pecaric, J. Power matrix means and related inequalities. Mathematical Communications, 1(2):91–110, 1996.
- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211–252, 2015.
- Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510– 4520, 2018.
- Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626, 2017.
- Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Sun, Y., Wang, X., and Tang, X. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2892–2900, 2015.
- Ulyanov, D., Vedaldi, A., and Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
- Wu, Y. and He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19, 2018.
- Yang, J., Ren, Z., Gan, C., Zhu, H., and Parikh, D. Crosschannel communication networks. In Advances in Neural Information Processing Systems, pp. 1295–1304, 2019.
- Yu, J., Yang, L., Xu, N., Yang, J., and Huang, T. Slimmable neural networks. arXiv preprint arXiv:1812.08928, 2018.
- Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Tags

Comments