SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization

CVPR, pp. 11589-11598, 2020.

Cited by: 28|Views146
EI
Weibo:
We identify that the conventional scaledecreased model, even with decoder network, is not effective for simultaneous recognition and localization

Abstract:

Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are p...More

Code:

Data:

0
Introduction
  • In the past a few years, the authors have witnessed a remarkable progress in deep convolutional neural network design.
  • Despite networks getting more powerful by increasing depth and width [10, 43], the meta-architecture design has not been changed since the invention of convolutional neural networks.
  • Most improvements of network architecture design are in adding network depth and connections within feature resolution groups [19, 10, 14, 45].
Highlights
  • In the past a few years, we have witnessed a remarkable progress in deep convolutional neural network design
  • Despite networks getting more powerful by increasing depth and width [10, 43], the meta-architecture design has not been changed since the invention of convolutional neural networks
  • RetinaNet: We evaluate SpineNet architectures on the COCO bounding box detection task with a RetinaNet detector
  • We show performance comparisons of SpineNet, ResNet-FPN, and Neural Architecture Search-FPN adopting training protocol A and B in Table 4
  • Note that SpineNet is learned on box detection with RetinaNet but works well with Mask R-CNN
  • We identify that the conventional scaledecreased model, even with decoder network, is not effective for simultaneous recognition and localization
Methods
  • The architecture of the proposed backbone model consists of a fixed stem network followed by a learned scalepermuted network.
  • A stem network is designed with scaledecreased architecture.
  • Blocks in the stem network can be candidate inputs for the following scale-permuted network.
  • A scale-permuted network is built with a list of building blocks {B1, B2, · · · , BN }.
  • Each block Bk has an associated feature level Li. Feature maps in an Li block have a resolution of
Results
  • RetinaNet: The authors evaluate SpineNet architectures on the COCO bounding box detection task with a RetinaNet detector.
  • The authors show performance comparisons of SpineNet, ResNet-FPN, and NAS-FPN adopting training protocol A and B in Table 4.
  • Table 6 shows the performance comparisons on val2017 of SpineNet and other backbones (e.g., ResNet-FPN [8] and HRNet [40]).
  • Being consistent with RetinaNet results, SpineNets are able to use fewer FLOPs and parameters but achieve better AP and mask AP at various model sizes.
  • Note that SpineNet is learned on box detection with RetinaNet but works well with Mask R-CNN
Conclusion
  • The authors identify that the conventional scaledecreased model, even with decoder network, is not effective for simultaneous recognition and localization.
  • The authors propose the scale-permuted model, a new meta-architecture, to address the issue.
  • To prove the effectiveness of scalepermuted models, the authors learn SpineNet by Neural Architecture Search in object detection and demonstrate it can be used directly in image classification.
  • The same SpineNet architecture achieves a comparable top-1 accuracy on ImageNet with much fewer FLOPs and 5% top-1 accuracy improvement on challenging iNaturalist dataset.
  • The authors hope the scale-permuted model will become the meta-architecture design of backbones across many visual tasks beyond detection and classification
Summary
  • Introduction:

    In the past a few years, the authors have witnessed a remarkable progress in deep convolutional neural network design.
  • Despite networks getting more powerful by increasing depth and width [10, 43], the meta-architecture design has not been changed since the invention of convolutional neural networks.
  • Most improvements of network architecture design are in adding network depth and connections within feature resolution groups [19, 10, 14, 45].
  • Methods:

    The architecture of the proposed backbone model consists of a fixed stem network followed by a learned scalepermuted network.
  • A stem network is designed with scaledecreased architecture.
  • Blocks in the stem network can be candidate inputs for the following scale-permuted network.
  • A scale-permuted network is built with a list of building blocks {B1, B2, · · · , BN }.
  • Each block Bk has an associated feature level Li. Feature maps in an Li block have a resolution of
  • Results:

    RetinaNet: The authors evaluate SpineNet architectures on the COCO bounding box detection task with a RetinaNet detector.
  • The authors show performance comparisons of SpineNet, ResNet-FPN, and NAS-FPN adopting training protocol A and B in Table 4.
  • Table 6 shows the performance comparisons on val2017 of SpineNet and other backbones (e.g., ResNet-FPN [8] and HRNet [40]).
  • Being consistent with RetinaNet results, SpineNets are able to use fewer FLOPs and parameters but achieve better AP and mask AP at various model sizes.
  • Note that SpineNet is learned on box detection with RetinaNet but works well with Mask R-CNN
  • Conclusion:

    The authors identify that the conventional scaledecreased model, even with decoder network, is not effective for simultaneous recognition and localization.
  • The authors propose the scale-permuted model, a new meta-architecture, to address the issue.
  • To prove the effectiveness of scalepermuted models, the authors learn SpineNet by Neural Architecture Search in object detection and demonstrate it can be used directly in image classification.
  • The same SpineNet architecture achieves a comparable top-1 accuracy on ImageNet with much fewer FLOPs and 5% top-1 accuracy improvement on challenging iNaturalist dataset.
  • The authors hope the scale-permuted model will become the meta-architecture design of backbones across many visual tasks beyond detection and classification
Tables
  • Table1: Number of blocks per level for stem and scalepermuted networks. The scale-permuted network is built on top of a scale-decreased stem network as shown in Figure 4. The size of scale-decreased stem network is gradually decreased to show the effectiveness of scale-permuted network
  • Table2: One-stage object detection results on COCO test-dev. We compare employing different backbones with RetinaNet except YOLOv3 [<a class="ref-link" id="c30" href="#r30">30</a>] on single model without test-time augmentation. By default we apply protocol B with multi-scale training and ReLU activation to train SpineNet models, as described in Section 5.1. SpineNet models marked by dagger (†) are trained with protocol C by applying stochastic depth and swish activation for a longer training schedule. Numbers for other methods are adopted from papers. FLOPs is represented by Multi-Adds
  • Table3: Results comparisons between R50-FPN and scalepermuted models on COCO val2017 by adopting protocol A. The performance improves with more computation being allocated to scale-permuted network. We also show the efficiency improvement by having scale and block type adjustments in Section 3.1
  • Table4: Performance improvements for models trained with training protocol A (APA) and B (APB) described in Section 5.1
  • Table5: Inference latency of RetinaNet with SpineNet on a V100 GPU with NVIDIA TensorRT. Latency is measured for an end-toend object detection pipeline including pre-processing, detection generation, and post-processing (e.g., NMS)
  • Table6: Two-stage object detection and instance segmentation results. We measure the performance of SpineNets with our Mask R-CNN implementation using 1000 proposals (marked by †). The performance of baseline Mask R-CNN is reported in [<a class="ref-link" id="c8" href="#r8">8</a>] and its FLOPs and Params are measured in our Mask R-CNN implementation. The performance of HRNets [<a class="ref-link" id="c40" href="#r40">40</a>] with Faster R-CNN are reported using 512 proposals in the open-sourced implementation. By default we apply protocol B with large-scale jittering and ReLU activation to train SpineNet models, as described in Section 5.1. All results are on COCO val2017 using single model without test-time augmentation
  • Table7: Image classification results on ImageNet and iNaturalist. Networks are sorted by increasing number of FLOPs. Note that the penultimate layer in ResNet outputs a 2048-dimensional feature vector for the classifier while SpineNet’s feature vector only has 256 dimensions. Therefore, on iNaturalist, ResNet and SpineNet have around 8M and 1M more parameters respectively
  • Table8: Importance of learned scale permutation. We compare our R0-SP53 model to hourglass and fish models with fixed block orderings. All models learn the cross-scale connections by NAS
  • Table9: Importance of learned cross-scale connections. We quantify the importance of learned cross-scale connections by performing three graph damages by removing edges of: (1) shortrange connections; (2) long-range connections; (3) all connections then sequentially connecting every pair of adjacent blocks
  • Table10: Mobile-size object detection results. We report single model results without test-time augmentation on COCO test-dev
  • Table11: The performance of SpineNet classification model can be further improved with a better training protocol by 1) adding stochastic depth, 2) replacing ReLU with swish activation and 3) using label smoothing of 0.1 (marked by †)
Download tables as Excel
Related work
  • 2.1. Backbone Model

    The progress of developing convolutional neural networks has mainly been demonstrated on ImageNet classification dataset [4]. Researchers have been improving model by increasing network depth [18], novel network connections [10, 35, 36, 34, 14, 13], enhancing model capacity [43, 17] and efficiency [3, 32, 12, 38]. Several works have demonstrated that using a model with higher ImageNet R50-FPN FPN SP30 R23 R0-SP53

    (a) R50-FPN @37.8% AP (b) R23-SP30 @39.6% AP (c) R0-SP53 @40.7% AP (d) SpineNet-49 @40.8% AP accuracy as the backbone model achieves higher accuracy in other visual prediction tasks [16, 21, 1].

    However, the backbones developed for ImageNet may not be effective for localization tasks, even combined with a decoder network such as [21, 1]. DetNet [20] argues that down-sampling features compromises its localization capability. HRNet [40] attempts to address the problem by adding parallel multi-scale inter-connected branches. Stacked Hourglass [27] and FishNet [33] propose recurrent down-sample and up-sample architecture with skip connections. Unlike backbones developed for ImageNet, which are mostly scale-decreased, several works above have considered backbones built with both down-sample and up-sample operations. In Section 5.5 we compare the scale-permuted model with Hourglass and Fish shape architectures.
Funding
  • Argues encoder-decoder architecture is ineffective in generating strong multi-scale features because of the scale-decreased backbone
  • Proposes SpineNet, a backbone with scalepermuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search
  • Proposes a metaarchitecture, called scale-permuted model, with two major improvements on backbone architecture design
  • Evaluates SpineNet on ImageNet and iNaturalist classification datasets
Reference
  • Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. 2, 3
    Google ScholarLocate open access versionFindings
  • Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Xinyu Xiao, and Jian Sun. Detnas: Backbone search for object detection. In Advances in Neural Information Processing Systems, 2019. 3, 5
    Google ScholarLocate open access versionFindings
  • Francois Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 2
    Google ScholarLocate open access versionFindings
  • Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, 2018. 5
    Google ScholarLocate open access versionFindings
  • Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In CVPR, 2019. 2, 3, 4, 5, 6, 9
    Google ScholarLocate open access versionFindings
  • Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2016
    Findings
  • Kaiming He, Ross Girshick, and Piotr Dollar. Rethinking imagenet pre-training. In ICCV, 2019. 5, 7
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 5
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 1, 2, 4
    Google ScholarLocate open access versionFindings
  • Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. In CVPR, 2019. 6
    Google ScholarLocate open access versionFindings
  • Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In ICCV, 2019. 2, 9
    Google ScholarLocate open access versionFindings
  • Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018. 2, 9
    Google ScholarLocate open access versionFindings
  • Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In ECCV, 2016. 5
    Google ScholarLocate open access versionFindings
  • Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017. 3
    Google ScholarLocate open access versionFindings
  • Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965, 2018. 2
    Findings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012. 2
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989. 1
    Google ScholarFindings
  • Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet: Design backbone for object detection. In ECCV, 2018. 3
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 2, 3, 4, 7
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In ICCV, 2017. 2, 5, 6
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 2, 5
    Google ScholarLocate open access versionFindings
  • Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Autodeeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, 2019. 3
    Google ScholarLocate open access versionFindings
  • Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In ECCV, 2018. 3
    Google ScholarLocate open access versionFindings
  • Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In ICLR, 2018. 3
    Google ScholarLocate open access versionFindings
  • Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016. 3, 7
    Google ScholarLocate open access versionFindings
  • Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions, 2017. 5
    Google ScholarFindings
  • Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In AAAI, 2019. 3
    Google ScholarLocate open access versionFindings
  • Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 6
    Findings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015. 5
    Google ScholarLocate open access versionFindings
  • Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018. 2, 9
    Google ScholarLocate open access versionFindings
  • Shuyang Sun, Jiangmiao Pang, Jianping Shi, Shuai Yi, and Wanli Ouyang. Fishnet: A versatile backbone for image, region, and pixel level prediction. In Advances in Neural Information Processing Systems, 2018. 3, 7
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, 2017. 2
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. 2
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. 2
    Google ScholarFindings
  • Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In CVPR, 2019. 9
    Google ScholarLocate open access versionFindings
  • Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019. 2, 3, 5
    Google ScholarLocate open access versionFindings
  • Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In CVPR, 2018. 5, 8
    Google ScholarLocate open access versionFindings
  • Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. PAMI, 2020. 3, 7
    Google ScholarLocate open access versionFindings
  • Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural networks for image recognition. In ICCV, 2019. 3
    Google ScholarLocate open access versionFindings
  • Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Auto-fpn: Automatic network architecture adaptation for object detection beyond classification. In ICCV, 2019. 3
    Google ScholarLocate open access versionFindings
  • Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ICLR, 2017. 2, 6
    Google ScholarLocate open access versionFindings
  • Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018. 1, 3
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments