Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks

Mark Kurtz
Mark Kurtz
John Carr
John Carr
Michael Goin
Michael Goin
William Leiserson
William Leiserson
Sage Moore
Sage Moore

ICML 2020, 2020.

Cited by: 0|Bibtex|Views20|Links
Keywords:
activation sparsitymemory footprintsparse inputRectified Linear Unitactivation functionMore(7+)
Weibo:
We have presented a framework for augmenting and leveraging activation sparsity in Deep neural networks for computational speedups

Abstract:

Optimizing deep neural networks for inference has recently become an extremely active area of research. One of the go-to solutions in this context is weight pruning, which aims to reduce computational and memory footprint by removing large subsets of the connections in a neural network. Surprisingly, much less attention has been given to ...More

Code:

Data:

0
Introduction
  • Deep neural networks (DNNs) are able to achieve state-ofthe-art performance in several application domains, such as image classification, speech recognition, and automated decision making, e.g. (Krizhevsky et al, 2012; Vaswani et al, 2017; Silver et al, 2016).
  • A non-trivial fraction of the activations are sparse as a natural consequence of the structure of Rectified Linear Unit (ReLU) activation functions
  • This observation has been leveraged by hardware accelerators, e.g.
  • To make use of activation sparsity at runtime, the authors implement an algorithm to perform sparse convolutions on data that is initially produced in a standard format.
  • The authors can apply Algorithm 1 to the compressed input
  • Both CSR compression and sparse-input convolution can be implemented efficiently on modern hardware, i.e. without the need to branch on zero elements
Highlights
  • Deep neural networks (DNNs) are able to achieve state-ofthe-art performance in several application domains, such as image classification, speech recognition, and automated decision making, e.g. (Krizhevsky et al, 2012; Vaswani et al, 2017; Silver et al, 2016)
  • A non-trivial fraction of the activations are sparse as a natural consequence of the structure of Rectified Linear Unit (ReLU) activation functions
  • (Georgiadis, 2019) explored L1 regularization to increase the number of zeroes in the activation maps, showing that sparsity can be increased by up to 60% for image classification models
  • If we examine the average activation map sparsity across several batches, we notice that layers which are closer to the input tend to have activation sparsity that is lower than this threshold, whereas later layers tend to have higher activation sparsity
  • We have presented a framework for augmenting and leveraging activation sparsity in Deep neural networks for computational speedups
  • The work closest to ours is (Georgiadis, 2019), who proposed and investigated the use of L1-regularization applied to the activation maps, and showed that it can result in a significant increase on a range of CNNs for image classification
  • Our techniques are implemented in an extensible, modular framework, which could be leveraged by researchers wishing to extend our results for both creating models with higher activation sparsity, or faster algorithms for sparse convolutions
Results
  • The authors implemented the sparse-input convolution in C++, on top of an existing fully-dense baseline framework, which uses optimized direct convolution or general matrix multiply (GeMM) operations for all layers.
  • This framework gets an (a) Input Activation Map Sparsities for ResNet18/ImageNet.
  • This framework gets an (a) Input Activation Map Sparsities for ResNet18/ImageNet. (b) Layer Latencies and Speedups for ResNet18/ImageNet
Conclusion
  • The authors have presented a framework for augmenting and leveraging activation sparsity in DNNs for computational speedups.
  • The authors' framework leverages two new techniques: on the machine learning side, a set of regularization and thresholding tools to boost the average and peak activation sparsity of networks; on the technical side, an algorithm for efficiently performing convolutions on sparse inputs, along with its optimized implementation in C++.
  • The authors' techniques are implemented in an extensible, modular framework, which could be leveraged by researchers wishing to extend the results for both creating models with higher activation sparsity, or faster algorithms for sparse convolutions.
  • The authors plan to explore additional strategies for memory-bound layers, and investigate the impact of quantization on sparsity on computational speedups
Summary
  • Introduction:

    Deep neural networks (DNNs) are able to achieve state-ofthe-art performance in several application domains, such as image classification, speech recognition, and automated decision making, e.g. (Krizhevsky et al, 2012; Vaswani et al, 2017; Silver et al, 2016).
  • A non-trivial fraction of the activations are sparse as a natural consequence of the structure of Rectified Linear Unit (ReLU) activation functions
  • This observation has been leveraged by hardware accelerators, e.g.
  • To make use of activation sparsity at runtime, the authors implement an algorithm to perform sparse convolutions on data that is initially produced in a standard format.
  • The authors can apply Algorithm 1 to the compressed input
  • Both CSR compression and sparse-input convolution can be implemented efficiently on modern hardware, i.e. without the need to branch on zero elements
  • Results:

    The authors implemented the sparse-input convolution in C++, on top of an existing fully-dense baseline framework, which uses optimized direct convolution or general matrix multiply (GeMM) operations for all layers.
  • This framework gets an (a) Input Activation Map Sparsities for ResNet18/ImageNet.
  • This framework gets an (a) Input Activation Map Sparsities for ResNet18/ImageNet. (b) Layer Latencies and Speedups for ResNet18/ImageNet
  • Conclusion:

    The authors have presented a framework for augmenting and leveraging activation sparsity in DNNs for computational speedups.
  • The authors' framework leverages two new techniques: on the machine learning side, a set of regularization and thresholding tools to boost the average and peak activation sparsity of networks; on the technical side, an algorithm for efficiently performing convolutions on sparse inputs, along with its optimized implementation in C++.
  • The authors' techniques are implemented in an extensible, modular framework, which could be leveraged by researchers wishing to extend the results for both creating models with higher activation sparsity, or faster algorithms for sparse convolutions.
  • The authors plan to explore additional strategies for memory-bound layers, and investigate the impact of quantization on sparsity on computational speedups
Tables
  • Table1: Average activation sparsities using different methods
  • Table2: Average inference running times in ms for batch size 64 on various models and variants (AWS C5.24xlarge for CPU and AWS P2.xlarge for GPU). Speedups are presented in brackets relative to the state-of-the art MXNet/MKL-DNN CPU inference framework
  • Table3: Average activation sparsity and speedup
Download tables as Excel
Related work
  • The literature on model compression for DNNs is extremely vast, so we restrict our attention to work on analyzing and leveraging activation sparsity. The fact that activation sparsity arises naturally is well-known, and has been leveraged by several architecture proposals, e.g. (Albericio et al, 2016; Han et al, 2016; Parashar et al, 2017); in particular, reference (Rhu et al, 2018) performed an in-depth analysis of activation sparsity on a range of convolutional models. We extend this analysis here.

    Another related line of work is that on compressing activation maps. A common technique for reducing the memory footprint of activation maps is quantization, which has been employed successfully by several references, see e.g. (Mishra et al, 2017) and references therein. We do not investigate quantization here, and leave a thorough treatment of the impact of our sparsification techniques in conjunction with quantization for future work. Reference (Gudovskiy et al, 2018) proposed a projection technique coupled with non-linear dimensionality reduction, which required modifying the network structure, while (Alwani et al, 2016) proposed to stochastically prune activations as an adversarial defense. Both techniques cause significant accuracy loss, and are therefore outside the scope of our study. Agostinelli et al (2014) propose learning piecewise linear activation functions to improve the accuracy of given models. FATReLU is piecewise linear, but the goals and methods we investigate in this paper are different.
Reference
  • Agostinelli, F., Hoffman, M., Sadowski, P., and Baldi, P. Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830, 2014.
    Findings
  • Albericio, J., Judd, P., Hetherington, T., Aamodt, T., Jerger, N. E., and Moshovos, A. Cnvlutin: Ineffectual-neuronfree deep neural network computing. ACM SIGARCH Computer Architecture News, 44(3):1–13, 2016.
    Google ScholarLocate open access versionFindings
  • Alwani, M., Chen, H., Ferdman, M., and Milder, P. Fusedlayer cnn accelerators. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 22. IEEE Press, 2016.
    Google ScholarLocate open access versionFindings
  • Bai, J., Lu, F., Zhang, K., et al. Onnx: Open neural network exchange. https://github.com/onnx/onnx, 2019.
    Findings
  • Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
    Findings
  • Chen, X. Escort: Efficient sparse convolutional neural networks on gpus. arXiv preprint arXiv:1802.10280, 2018.
    Findings
  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255.
    Google ScholarLocate open access versionFindings
  • Dong, X., Liu, L., Li, G., Li, J., Zhao, P., Wang, X., and Feng, X. Exploiting the input sparsity to accelerate deep neural networks: poster. In Hollingsworth, J. K. and Keidar, I. (eds.), Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16-20, 2019, pp. 401–402. ACM, 2019. ISBN 978-1-4503-6225-2. doi: 10. 1145/3293883.3295713. URL https://doi.org/10.1145/3293883.3295713.
    Locate open access versionFindings
  • Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
    Findings
  • Gale, T., Elsen, E., and Hooker, S. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
    Findings
  • Gray, S., Radford, A., and Kingma, D. P. Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224, 2017.
    Findings
  • Gudovskiy, D., Hodgkinson, A., and Rigazio, L. Dnn feature map compression using learned representation over gf (2). In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0, 2018.
    Google ScholarLocate open access versionFindings
  • Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015.
    Google ScholarLocate open access versionFindings
  • Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., and Dally, W. J. Eie: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
    Findings
  • Hoyer, P. O. Non-negative matrix factorization with sparseness constraints. Journal of machine learning research, 5 (Nov):1457–1469, 2004.
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A., Nair, V., and Hinton, G. The cifar-10 dataset. online: http://www.cs.toronto.edu/kriz/cifar.html, 55, 2014.
    Locate open access versionFindings
  • Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
    Findings
  • Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744, 2017.
    Google ScholarLocate open access versionFindings
  • Georgiadis, G. Accelerating convolutional neural networks via activation map compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7085–7095, 2019.
    Google ScholarLocate open access versionFindings
  • Luo, J.-H., Wu, J., and Lin, W. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066, 2017.
    Google ScholarLocate open access versionFindings
  • Mellempudi, N., Kundu, A., Mudigere, D., Das, D., Kaul, B., and Dubey, P. Ternary neural networks with finegrained quantization. arXiv preprint arXiv:1705.01462, 2017.
    Findings
  • Mishra, A. K., Nurvitadhi, E., Cook, J. J., and Marr, D. WRPN: wide reduced-precision networks. CoRR, abs/1709.01134, 2017. URL http://arxiv.org/abs/1709.01134.
    Findings
  • Molchanov, D., Ashukha, A., and Vetrov, D. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2498–2507. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S. W., and Dally, W. J. Scnn: An accelerator for compressed-sparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 27–40. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Park, J., Li, S., Wen, W., Tang, P. T. P., Li, H., Chen, Y., and Dubey, P. Faster cnns with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409, 2016a.
    Findings
  • Park, J., Li, S. R., Wen, W., Li, H., Chen, Y., and Dubey, P. Holistic sparsecnn: Forging the trident of accuracy, speed, and size. arXiv preprint arXiv:1608.01409, 1(2), 2016b.
    Findings
  • Polino, A., Pascanu, R., and Alistarh, D. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018.
    Findings
  • Rhu, M., O’Connor, M., Chatterjee, N., Pool, J., Kwon, Y., and Keckler, S. W. Compressing dma engine: Leveraging activation sparsity for training deep neural networks. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 78–91. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
    Google ScholarLocate open access versionFindings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
    Findings
  • Wu, X., Wu, Y., and Zhao, Y. High performance binarized neural networks trained on the imagenet classification task. CoRR, abs/1604.03058, 2016. URL http://arxiv.org/abs/1604.03058.
    Findings
  • Yang, H., Wen, W., and Li, H. Deephoyer: Learning sparser neural network with differentiable scale-invariant sparsity measures. arXiv preprint arXiv:1908.09979, 2019.
    Findings
  • Zagoruyko, S. and Komodakis, N. Wide Residual Networks. ArXiv e-prints, May 2016.
    Google ScholarFindings
  • Zhang, H., Li, J., Kara, K., Alistarh, D., Liu, J., and Zhang, C. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In International Conference on Machine Learning, pp. 4035–4043, 2017.
    Google ScholarLocate open access versionFindings
  • Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained ternary quantization. CoRR, abs/1612.01064, 2016. URL http://arxiv.org/abs/1612.01064.
    Findings
  • Zhu, M. and Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.
    Findings
Your rating :
0

 

Tags
Comments