Dynamic Convolution: Attention over Convolution Kernels

CVPR, pp. 11027-11036, 2019.

Cited by: 4|Bibtex|Views51|Links
EI
Keywords:
deep neural networkmobile devicedynamic convolutional neural networksperformance degradationarchitecture searchMore(19+)
Weibo:
We introduce dynamic convolution, which aggregates multiple convolution kernels dynamically based upon their attentions for each input

Abstract:

Light-weight convolutional neural networks (CNNs) suffer performance degradation as their low computational budgets constrain both the depth (number of convolution layers) and width (number of channels) of CNNs, resulting in limited representation capability. To address this issue, we present dynamic convolution, a new design that incre...More

Code:

Data:

0
Introduction
  • Interest in building light-weight and efficient neural networks has exploded recently.
  • When the computational cost of MobileNetV3 reduces from 219M to 66M Multi-Adds, the top1 accuracy of ImageNet classification drops from 75.2% to 67.4%.
  • This is because the extremely low computational cost severely constrains both the network depth and width, which are crucial for the network performance but proportional to the computational cost
Highlights
  • Interest in building light-weight and efficient neural networks has exploded recently
  • We describe dynamic convolutional neural networks (DY-convolutional neural networks)
  • K = 4 kernels are used in each dynamic convolution layer and temperature annealing is used in the training
  • We focus on efficient convolutional neural networks, we evaluate dynamic convolution on two shallow ResNets (ResNet-10 and ResNet-18) to show its
  • We introduce dynamic convolution, which aggregates multiple convolution kernels dynamically based upon their attentions for each input
  • We hope dynamic convolution becomes a useful component for efficient network architectures
Methods
  • CondConv [37] DY-CNNs CondConv [37] DY-CNNs. MAdds 329M 312.9M 113M 101.4M epochs, it has significantly larger kernel space than the method.
  • Learning attention model πk(x) becomes more difficult.
  • The authors' method has less kernels per layer, smaller model size, less computations but achieves higher accuracy
Results
  • Table 8 shows the comparison between dynamic convolution and its static counterpart in three CNN architectures (MobileNetV2, MobileNetV3 and ResNet).
  • The authors compare dynamic convolution with its static counterpart in the backbone (Type-A).
  • Dynamic convolution gains 1.6, 2.9, 2.2 AP for ResNet18, MobileNetV2 and MobileNetV3-Small, respectively.
  • Similar to Type-A, dynamic convolution outperforms its static counterpart by a clear margin.
  • It gains 3.6 and 2.9 AP for MobileNetV2 and MobileNetV3-Small, respectively
Conclusion
  • The authors introduce dynamic convolution, which aggregates multiple convolution kernels dynamically based upon their attentions for each input.
  • Compared to its static counterpart, it significantly improves the representation capability with negligible extra computation cost, is more friendly to efficient CNNs. The authors' dynamic convolution can be integrated into existing CNN architectures.
  • By replacing each convolution kernel in MobileNet (V2 and V3) with dynamic convolution, the authors achieve solid improvement for both image classification and human pose estimation.
  • The authors hope dynamic convolution becomes a useful component for efficient network architectures.
Summary
  • Introduction:

    Interest in building light-weight and efficient neural networks has exploded recently.
  • When the computational cost of MobileNetV3 reduces from 219M to 66M Multi-Adds, the top1 accuracy of ImageNet classification drops from 75.2% to 67.4%.
  • This is because the extremely low computational cost severely constrains both the network depth and width, which are crucial for the network performance but proportional to the computational cost
  • Methods:

    CondConv [37] DY-CNNs CondConv [37] DY-CNNs. MAdds 329M 312.9M 113M 101.4M epochs, it has significantly larger kernel space than the method.
  • Learning attention model πk(x) becomes more difficult.
  • The authors' method has less kernels per layer, smaller model size, less computations but achieves higher accuracy
  • Results:

    Table 8 shows the comparison between dynamic convolution and its static counterpart in three CNN architectures (MobileNetV2, MobileNetV3 and ResNet).
  • The authors compare dynamic convolution with its static counterpart in the backbone (Type-A).
  • Dynamic convolution gains 1.6, 2.9, 2.2 AP for ResNet18, MobileNetV2 and MobileNetV3-Small, respectively.
  • Similar to Type-A, dynamic convolution outperforms its static counterpart by a clear margin.
  • It gains 3.6 and 2.9 AP for MobileNetV2 and MobileNetV3-Small, respectively
  • Conclusion:

    The authors introduce dynamic convolution, which aggregates multiple convolution kernels dynamically based upon their attentions for each input.
  • Compared to its static counterpart, it significantly improves the representation capability with negligible extra computation cost, is more friendly to efficient CNNs. The authors' dynamic convolution can be integrated into existing CNN architectures.
  • By replacing each convolution kernel in MobileNet (V2 and V3) with dynamic convolution, the authors achieve solid improvement for both image classification and human pose estimation.
  • The authors hope dynamic convolution becomes a useful component for efficient network architectures.
Tables
  • Table1: Mult-Adds of static convolution and dynamic convolution in MobileNetV2 with four different width multipliers (×1.0, ×0.75, ×0.5, and ×0.35)
  • Table2: Comparison between DY-CNNs and the concurrent work (CondConv [<a class="ref-link" id="c37" href="#r37">37</a>]) on ImageNet classification using MobileNetV2 ×1.0 and ×0.5
  • Table3: Inspecting DY-CNN using different kernel aggregations. DY-MobileNetV2 ×0.5 is used. The proper aggregation of convolution kernels {Wk} using attention πk(x) is shown in the first line. Shuffle per image means shuffling the attention weights for the same image over different kernels. Shuffle across images means using the attention of an image x for another image x . The poor performance for the bottom four aggregations validates that the DY-CNN is dynamic
  • Table4: Inspecting DY-CNN by enabling/disabling attention at different input resolutions. DY-MobileNetV2 ×0.5 is used
  • Table5: Dynamic convolution at different layers in MobileNetV2 ×0.5. C1, C2 and C3 indicate the 1 × 1 convolution that expands output channels, the 3 × 3 depthwise convolution and the 1 × 1 convolution that shrinks output channels per block respectively. C1=1 indicates using static convolution, while C1=4 indicates using dynamic convolution with 4 kernels. The numbers in brackets denote the improvement over the baseline
  • Table6: Softmax Temperature: large temperature in early training epochs is important. Temperature annealing refers to reducing τ from 30 to 1 linearly in the first 10 epochs. The numbers in brackets denote the performance improvement over the baseline
  • Table7: Dynamic convolution vs Squeeze-and-Excitation (SE [<a class="ref-link" id="c13" href="#r13">13</a>]) on MobileNetV3-Small. The numbers in brackets denote the performance improvement over the baseline. Compared with static convolution with SE, dynamic convolution without SE gains 2.2% top-1 accuracy
  • Table8: ImageNet [<a class="ref-link" id="c4" href="#r4">4</a>] classification results of DY-CNNs. The numbers in brackets denote the performance improvement over the baseline
  • Table9: Light-weight head structures for keypoint detection. We use MobileNetV2’s bottleneck residual block [<a class="ref-link" id="c27" href="#r27">27</a>] (denoted as bneck). Each row is corresponding to a stage, which starts with a bilinear upsampling operator to scale up the feature map by 2. #out denotes the number of output channels, and n denotes the number of bottleneck residual blocks
  • Table10: Keypoint detection results on COCO validation set. All models are trained from scratch. The top half uses dynamic convolution in the backbone and uses deconvolution in the head (Type A). The bottom half use MobileNetV2’s bottleneck residual blocks in the head and use dynamic convolution in both the backbone and the head (Type B). Each dynamic convolution layer includes K = 4 kernels. The numbers in brackets denote the performance improvement over the baseline
  • Table11: Keypoint detection results of using dynamic convolution in backbone and head separately. We use MobileNetV2 ×0.5 as backbone and use the light-weight head structure discussed in Table 9. The numbers in brackets denote the performance improvement over the baseline. Dynamic convolution can improve AP at both the backbone and the head
  • Table12: Inference running time of DY-MobileNetV2 [<a class="ref-link" id="c27" href="#r27">27</a>] on ImageNet [<a class="ref-link" id="c4" href="#r4">4</a>] classification. We use dynamic convolution with K = 4 kernels for all convolution layers in DY-MobileNetV2 except the first layer. CPU: CPU time in milliseconds measured on a single core of Intel Xeon CPU E5-2650 v3 (2.30GHz). The running time is calculated by averaging the inference time of 5,000 images with batch size 1. The numbers in brackets denote the performance improvement over the baseline
Download tables as Excel
Related work
  • Efficient CNNs: Recently, designing efficient CNN architectures [15, 12, 27, 11, 42, 23] has been an active research area. SqueezeNet [15] reduces the number of parameters by using 1 × 1 convolution extensively in the fire module. MobileNetV1 [12] substantially reduces FLOPs by decomposing a 3 × 3 convolution into a depthwise convolution and a pointwise convolution. Based upon this, MobileNetV2 [27] introduces inverted residuals and linear bottlenecks. MobileNetV3 [11] applies squeeze-and-excitation [13] in the residual layer and employs a platform-aware neural architecture approach [29] to find the optimal network structures. ShuffleNet further reduces MAdds for 1 × 1 convolution by channel shuffle operations. ShiftNet [33] replaces expensive spatial convolution by the shift operation and pointwise convolutions. Compared with existing methods, our dynamic convolution can be used to replace any static convolution kernels (e.g. 1 × 1, 3 × 3, depthwise convolution, group convolution) and is complementary to other advanced operators like squeeze-and-excitation.
Funding
  • Presents Dynamic Convolution, a new design that increases model complexity without increasing the network depth or width
  • Proposes a new operator design, named dynamic convolution, to increase the representation capability with negligible extra FLOPs
  • Found two keys for efficient joint optimization: constraining the attention output as k πk(x) = 1 to facilitate the learning of attention model πk(x), and flattening attention in early training epochs to facilitate the learning of convolution kernels {Wk, ̃bk}
  • Demonstrates the effectiveness of dynamic convolution on both image classification and keypoint detection
Reference
  • Han Cai, Chuang Gan, and Song Han. Once for all: Train one network and specialize it for efficient deployment. ArXiv, abs/1908.09791, 2019.
    Findings
  • Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3123–3131. Curran Associates, Inc., 2015.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
    Google ScholarLocate open access versionFindings
  • Xiaohan Ding, Yuchen Guo, Guiguang Ding, and Jungong Han. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. The MIT Press, 2016.
    Google ScholarFindings
  • Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot neural architecture search with uniform sampling, 2019.
    Google ScholarFindings
  • Song Han, Huizi Mao, and William Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 10 2016.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In The European Conference on Computer Vision (ECCV), September 2018.
    Google ScholarLocate open access versionFindings
  • Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. CoRR, abs/1905.02244, 2019.
    Findings
  • Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
    Findings
  • Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
    Google ScholarLocate open access versionFindings
  • Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Weinberger. Multi-scale dense networks for resource efficient image classification. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016.
    Findings
  • Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stage architecture for object recognition? In The IEEE International Conference on Computer Vision (ICCV), 2009.
    Google ScholarLocate open access versionFindings
  • Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In Advances in Neural Information Processing Systems, pages 2181–2191. 2017.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755.
    Google ScholarLocate open access versionFindings
  • Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Lanlan Liu and Jia Deng. Dynamic deep neural networks: Optimizing accuracy-efficiency trade-offs by selective execution. In AAAI Conference on Artificial Intelligence (AAAI), 2018.
    Google ScholarLocate open access versionFindings
  • Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
    Google ScholarLocate open access versionFindings
  • Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In The European Conference on Computer Vision (ECCV), September 2018.
    Google ScholarLocate open access versionFindings
  • Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
    Google ScholarLocate open access versionFindings
  • Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier architecture search. In AAAI Conference on Artificial Intelligence (AAAI), 2018.
    Google ScholarLocate open access versionFindings
  • Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
    Google ScholarLocate open access versionFindings
  • Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. Mnasnet: Platform-aware neural architecture search for mobile. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
    Google ScholarLocate open access versionFindings
  • Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed precision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
    Google ScholarLocate open access versionFindings
  • Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In The European Conference on Computer Vision (ECCV), September 2018.
    Google ScholarLocate open access versionFindings
  • Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
    Google ScholarLocate open access versionFindings
  • Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: A zero flop, zero parameter alternative to spatial convolutions. 2017.
    Google ScholarFindings
  • Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
    Google ScholarLocate open access versionFindings
  • Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In European conference on computer vision, 04 2018.
    Google ScholarLocate open access versionFindings
  • Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: stochastic neural architecture search. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Brandon Yang, Gabriel Bender, Quoc V. Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolutions for efficient inference. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Jiwei Yang, Xu Shen, Jun Xing, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, and Xian-sheng Hua. Quantization networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
    Google ScholarLocate open access versionFindings
  • Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In The European Conference on Computer Vision (ECCV), September 2018.
    Google ScholarLocate open access versionFindings
  • Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
    Google ScholarLocate open access versionFindings
  • Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2017.
    Findings
  • Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments