TRP: Trained Rank Pruning for Efficient Deep Neural Networks

IJCAI, pp. 977-983, 2020.

Cited by: 1|Bibtex|Views301|Links
EI
Keywords:
rank pruningsub gradientgradient descentT z − α flow rank decompositionMore(13+)
Weibo:
We proposed a new scheme Trained Rank Pruning for training low-rank networks

Abstract:

To enable DNNs on edge devices like mobile phones, low-rank approximation has been widely adopted because of its solid theoretical rationale and efficient implementations. Several previous works attempted to directly approximate a pretrained model by low-rank decomposition; however, small approximation errors in parameters can ripple ov...More

Code:

Data:

0
Introduction
  • Deep Neural Networks (DNNs) have shown remarkable success in many computer vision tasks.
  • To address this problem, many network compression and acceleration methods have been proposed.
  • Different from precedent works, this paper proposes a novel approach to design low-rank networks
Highlights
  • Deep Neural Networks (DNNs) have shown remarkable success in many computer vision tasks
  • We propose a new method, namely Trained Rank Pruning (TRP), for training low-rank networks
  • We propose a stochastic sub-gradient descent optimized nuclear regularization that further constrains the weights in a low-rank space to boost the Trained Rank Pruning
  • We proposed a new scheme Trained Rank Pruning (TRP) for training low-rank networks
  • We propose stochastic sub-gradient descent optimized nuclear norm regularization to boost the Trained Rank Pruning
  • The proposed Trained Rank Pruning can be incorporated with any low-rank decomposition method
Methods
  • The authors implement the TRP scheme with NVIDIA 1080 Ti GPUs. For training on CIFAR-10, the authors start with base learning rate of 0.1 to train 164 epochs and degrade the value by a factor of 10 at the 82-th and 122-th epoch.
  • For ImageNet, the authors directly finetune the model with TRP scheme from the pretrained baseline with learning rate 0.0001 for 10 epochs.
  • The authors adopt the retrained data independent decomposition as the basic methods
Results
  • The TSVD energy threshold in TRP and TRP+Nu is 0.02 and the nuclear norm weight λ is set as 0.0003
  • The authors decompose both the 1 × 1 and 3 × 3 layers in ResNet-56.
  • In the channelwise decomposition (TRP2) of ResNet-56, results of TRP combined with nuclear regularization can even achieve 2× speed up rate than [Zhang et al, 2016] with same accuracy drop.
  • With the same 2.30× acceleration rate, the performance is better than [Zhou et al, 2019]
Conclusion
  • The authors proposed a new scheme Trained Rank Pruning (TRP) for training low-rank networks.
  • It leverages capacity and structure of the original network by embedding the low-rank approximation in the training process.
  • The authors propose stochastic sub-gradient descent optimized nuclear norm regularization to boost the TRP.
  • The proposed TRP can be incorporated with any low-rank decomposition method.
  • On CIFAR-10 and ImageNet datasets, the authors have shown that the methods can outperform basic methods and other pruning based methods both in channel-wise decmposition and spatial-wise decomposition
Summary
  • Introduction:

    Deep Neural Networks (DNNs) have shown remarkable success in many computer vision tasks.
  • To address this problem, many network compression and acceleration methods have been proposed.
  • Different from precedent works, this paper proposes a novel approach to design low-rank networks
  • Methods:

    The authors implement the TRP scheme with NVIDIA 1080 Ti GPUs. For training on CIFAR-10, the authors start with base learning rate of 0.1 to train 164 epochs and degrade the value by a factor of 10 at the 82-th and 122-th epoch.
  • For ImageNet, the authors directly finetune the model with TRP scheme from the pretrained baseline with learning rate 0.0001 for 10 epochs.
  • The authors adopt the retrained data independent decomposition as the basic methods
  • Results:

    The TSVD energy threshold in TRP and TRP+Nu is 0.02 and the nuclear norm weight λ is set as 0.0003
  • The authors decompose both the 1 × 1 and 3 × 3 layers in ResNet-56.
  • In the channelwise decomposition (TRP2) of ResNet-56, results of TRP combined with nuclear regularization can even achieve 2× speed up rate than [Zhang et al, 2016] with same accuracy drop.
  • With the same 2.30× acceleration rate, the performance is better than [Zhou et al, 2019]
  • Conclusion:

    The authors proposed a new scheme Trained Rank Pruning (TRP) for training low-rank networks.
  • It leverages capacity and structure of the original network by embedding the low-rank approximation in the training process.
  • The authors propose stochastic sub-gradient descent optimized nuclear norm regularization to boost the TRP.
  • The proposed TRP can be incorporated with any low-rank decomposition method.
  • On CIFAR-10 and ImageNet datasets, the authors have shown that the methods can outperform basic methods and other pruning based methods both in channel-wise decmposition and spatial-wise decomposition
Tables
  • Table1: Experiment results on CIFAR-10. ”R-“ indicates ResNet-
  • Table2: Results of ResNet-18 on ImageNet
  • Table3: Results of ResNet-50 on ImageNet
  • Table4: Actual inference time per image on ResNet-18
Download tables as Excel
Related work
  • A lot of works have been proposed to accelerate the inference process of deep neural networks. Briefly, these works could be categorized into three main categories: quantization, pruning, and low-rank decomposition.

    Quantization Weight quantization methods include training a quantized model from scratch [Chen et al, 2015; Courbariaux and Bengio, 2016; Rastegari et al, 2016] or converting a pre-trained model into quantized representation [Zhou et al, 2017; Han et al, 2015a; Xu et al, 2018]. The quantized weight representation includes binary value [Rastegari et al, 2016; Courbariaux and Bengio, 2016] or hash buckets [Chen et al, 2015]. Note that our method is inspired by the scheme of combining quantization with training process, i.e. we embed the low-rank decomposition into training process to explicitly guide the parameter to a low-rank form.

    Pruning Non-structured and structured sparsity are introduced by pruning. [Han et al, 2015b] proposes to prune unimportant connections between neural units with small weights in a pre-trained CNN. [Wen et al, 2016] utilizes group Lasso strategy to learn the structure sparsity of networks. [Liu et al, 2017] adopts a similar strategy by explicitly imposing scaling factors on each channel to measure the importance of each connection and dropping those with small weights. In [He et al, 2017], the pruning problem is formulated as a data recovery problem. Pre-trained filters are reweighted by minimizing a data recovery objective function. Channels with smaller weight are pruned. [Luo et al, 2017] heuristically selects filters using change of next layer’s output as a criterion.
Funding
  • This work was supported in part by the National Natural Science Foundation of China under Grants 61720106001, 61932022, and in part by the Program of Shanghai Academic Research Leader under Grant 17XD1401900
Reference
  • [Alvarez and Salzmann, 2017] Jose M Alvarez and Mathieu Salzmann. Compression-aware training of deep networks. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • [Avron et al., 2012] Haim Avron, Satyen Kale, Shiva Prasad Kasiviswanathan, and Vikas Sindhwani. Efficient and practical stochastic subgradient descent for nuclear norm regularization. In ICML, 2012.
    Google ScholarLocate open access versionFindings
  • [Chen et al., 2015] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • [Courbariaux and Bengio, 2016] Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830, 2016.
    Findings
  • [Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • [Denton et al., 2014] Emily Denton, Wojciech Zaremba, Joan Bruna, Yann Lecun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • [Guo et al., 2018] Jianbo Guo, Yuxi Li, Weiyao Lin, Yurong Chen, and Jianguo Li. Network decoupling: From regular to depthwise separable convolutions. In BMVC, 2018.
    Google ScholarLocate open access versionFindings
  • [Han et al., 2015a] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.
    Findings
  • [Han et al., 2015b] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016.
    Google ScholarFindings
  • [He et al., 2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • [Jaderberg et al., 2014] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
    Findings
  • [Krizhevsky and Hinton, 2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Computer Science, 2009.
    Google ScholarLocate open access versionFindings
  • [Li et al., 2016] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
    Findings
  • [Li et al., 2017] Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. Training quantized nets: A deeper understanding. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • [Liu et al., 2017] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.
    Google ScholarFindings
  • Learning efficient convolutional networks through network slimming. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • [Luo et al., 2017] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • [Luo et al., 2018] Jian-Hao Luo, Hao Zhang, Hong-Yu Zhou, Chen-Wei Xie, Jianxin Wu, and Weiyao Lin. Thinet: pruning cnn filters for a thinner net. TPAMI, 2018.
    Google ScholarLocate open access versionFindings
  • [Ma et al., 2018] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. arXiv preprint arXiv:1807.11164, 2018.
    Findings
  • [Mirsky, 1960] Leon Mirsky. Symmetric gauge functions and unitarily invariant norms. The quarterly journal of mathematics, 11(1):50–59, 1960.
    Google ScholarLocate open access versionFindings
  • [Rastegari et al., 2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • [Sandler et al., 2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, June 2018.
    Google ScholarLocate open access versionFindings
  • [Stewart, 1990] Gilbert W Stewart. Matrix perturbation theory. 1990.
    Google ScholarFindings
  • [Watson, 1992] G Alistair Watson. Characterization of the subdifferential of some matrix norms. Linear algebra and its applications, 170:33–45, 1992.
    Google ScholarLocate open access versionFindings
  • [Wen et al., 2016] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • [Wen et al., 2017] Wei Wen, Cong Xu, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Coordinating filters for faster deep neural networks. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • [Xu et al., 2018] Yuhui Xu, Yongzhuang Wang, Aojun Zhou, Weiyao Lin, and Hongkai Xiong. Deep neural network compression with single and multiple level quantization. CoRR, abs/1803.03289, 2018.
    Findings
  • [Xu et al., 2019] Yuhui Xu, Yuxi Li, Shuai Zhang, Wei Wen, Botao Wang, Yingyong Qi, Yiran Chen, Weiyao Lin, and Hongkai Xiong. Trained rank pruning for efficient deep neural networks. In NIPS EMC2 workshop, 2019.
    Google ScholarLocate open access versionFindings
  • [Zhang et al., 2016] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. TPAMI, 38(10):1943–1955, 2016.
    Google ScholarLocate open access versionFindings
  • [Zhou et al., 2017] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044, 2017.
    Findings
  • [Zhou et al., 2019] Yuefu Zhou, Ya Zhang, Yanfeng Wang, and Qi Tian. Accelerate cnn via recursive bayesian pruning. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments