DetNAS: Backbone Search for Object Detection

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), pp. 6638-6648, 2019.

Cited by: 21|Bibtex|Views154|
EI
Keywords:
image classificationobject detection
Weibo:
As in Fig. 1, DetNAS consists of 3 steps: supernet pre-training on ImageNet, supernet fine-tuning on detection datasets and architecture search on the trained supernet

Abstract:

Object detectors are usually equipped with backbone networks designed for image classification. It might be sub-optimal because of the gap between the tasks of image classification and object detection. In this work, we present DetNAS to use Neural Architecture Search (NAS) for the design of better backbones for object detection. It is no...More

Code:

Data:

0
Introduction
  • The performance of object detectors highly relies on features extracted by backbones.
  • Many object detectors directly use networks designed for Image classification as backbones.
  • It might be sub-optimal because image classification focus on what the main object of an image is, while object detection aims at finding where and what each object instance.
  • The handcrafting process heavily relies on expert knowledge and tedious trials
Highlights
  • Backbones play an important role in object detectors
  • Our goal is to extend Neural Architecture Search (NAS) to search for backbones in object detectors
  • As in Fig. 1, DetNAS consists of 3 steps: supernet pre-training on ImageNet, supernet fine-tuning on detection datasets and architecture search on the trained supernet
  • We search on feature pyramid networks (FPN) because it is a mainstream two-stage detector that has been used in other vision tasks, e.g., instance segmentation and skeleton detection [9]
  • We present DetNAS, the first attempt to search backbones in object detectors without any proxy
  • DetNAS are consistently better than the network searched on ImageNet classification by more than 3% on VOC and 1% on COCO, no matter on FPN or RetinaNet
  • Our method consists of three steps: supernet pre-training on ImageNet, supernet fine-tuning on detection datasets and searching on the trained supernet with evolutionary algorithm (EA)
Results
  • DetNASNet, is searched on FPN in the large search space.
  • The architecture of DetNASNet is depicted in the supplementary material.
  • The authors search on FPN because it is a mainstream two-stage detector that has been used in other vision tasks, e.g., instance segmentation and skeleton detection [9].
  • Table 2 shows the main results.
  • The authors list three hand-crafted networks for comparisons, including ResNet-50, ResNet-101 and ShuffleNetv2-40.
  • DetNASNet achieves 40.2% mmAP with only 1.3G FLOPs. It is superior to ResNet-50 and equal to ResNet-101
Conclusion
  • The authors present DetNAS, the first attempt to search backbones in object detectors without any proxy.
  • The computation cost of DetNAS, 44 GPU days on COCO, is just two times as training a common object detector.
  • The authors test DetNAS on various object detectors (FPN and RetinaNet) and different datasets (COCO and VOC).
  • ClsNASNet and DetNAS have different and meaningful architecture-level patterns.
  • This might, in return, provide some insights for the hand-crafted architecture design
Summary
  • Introduction:

    The performance of object detectors highly relies on features extracted by backbones.
  • Many object detectors directly use networks designed for Image classification as backbones.
  • It might be sub-optimal because image classification focus on what the main object of an image is, while object detection aims at finding where and what each object instance.
  • The handcrafting process heavily relies on expert knowledge and tedious trials
  • Objectives:

    The authors' goal is to extend NAS to search for backbones in object detectors.
  • Results:

    DetNASNet, is searched on FPN in the large search space.
  • The architecture of DetNASNet is depicted in the supplementary material.
  • The authors search on FPN because it is a mainstream two-stage detector that has been used in other vision tasks, e.g., instance segmentation and skeleton detection [9].
  • Table 2 shows the main results.
  • The authors list three hand-crafted networks for comparisons, including ResNet-50, ResNet-101 and ShuffleNetv2-40.
  • DetNASNet achieves 40.2% mmAP with only 1.3G FLOPs. It is superior to ResNet-50 and equal to ResNet-101
  • Conclusion:

    The authors present DetNAS, the first attempt to search backbones in object detectors without any proxy.
  • The computation cost of DetNAS, 44 GPU days on COCO, is just two times as training a common object detector.
  • The authors test DetNAS on various object detectors (FPN and RetinaNet) and different datasets (COCO and VOC).
  • ClsNASNet and DetNAS have different and meaningful architecture-level patterns.
  • This might, in return, provide some insights for the hand-crafted architecture design
Tables
  • Table1: Search space of DetNAS
  • Table2: Main result comparisons
  • Table3: Ablation studies
  • Table4: Computation cost for each step on COCO
  • Table5: Comparisons to the random baseline
Download tables as Excel
Related work
  • 2.1 Object Detection

    Object detection aims to locate each object instance and assign a class to it in an image. With the rapid progress of deep convolutional networks, object detectors, such as FPN [14] and RetinaNet [15], have achieved great improvements in accuracy. In general, an object detector can be divided into two parts, a backbone network, and a "head". In the past few years, many advances in object detection come from the study of "head", such as architecture [14], loss [15, 24], and anchor [29, 26]. FPN [14] develops a top-down architecture with lateral connections to integrate features at all scales as an effective feature extractor. The focal loss [15] is proposed in RetinaNet to solve the problem of class imbalance, which leads to the instability in early training. MetaAnchor [29] proposes a dynamic anchor mechanism to boost the performance for anchor-based object detectors. However, for the backbone network, almost all object detectors adopt networks for image classification, which might be sub-optimal. Because object detection cares about not only "what" object is, which image classification only focuses, but also "where" it is. Similar to our work, DetNet [12] also exploits the architecture of the backbone that specially designed for object detection manually. Inspired by NAS, we present DetNAS to find the optimal backbone automatically for object detection in this work.
Funding
  • This work is supported by Major Project for New Generation of AI Grant (No 2018AAA0100402), National Key R&D Program of China (No 2017YFA0700800), and the National Natural Science Foundation of China under Grants 61976208, 91646207, 61573352, and 61773377
  • This work is also supported by Beijing Academy of Artificial Intelligence (BAAI)
Reference
  • Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc V. Le. Understanding and simplifying one-shot architecture search. In ICML, pages 549–558, 2018.
    Google ScholarLocate open access versionFindings
  • Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. SMASH: one-shot model architecture search through hypernetworks. CoRR, abs/1708.05344, 2017.
    Findings
  • Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. ICLR, abs/1812.00332, 2019.
    Google ScholarLocate open access versionFindings
  • Jianlong Chang, Xinbang Zhang, Yiwen Guo, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Differentiable architecture search with ensemble gumbel-softmax. abs/1905.01786, 2019.
    Google ScholarFindings
  • Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, and Quoc V. Le. NAS-FPN: learning scalable feature pyramid architecture for object detection. CoRR, abs/1904.07392, 2019.
    Findings
  • Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He. Detectron, 2018.
    Google ScholarLocate open access versionFindings
  • Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot neural architecture search with uniform sampling. abs/1904.00420, 2019.
    Google ScholarFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking imagenet pre-training. page abs/1811.08883, 2019.
    Google ScholarFindings
  • Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. CoRR, abs/1902.07638, 2019.
    Findings
  • Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet: Design backbone for object detection. In ECCV, pages 339–354.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In CVPR, pages 936–944, 2017.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2999–3007, 2017.
    Google ScholarLocate open access versionFindings
  • Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L. Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. CoRR, abs/1901.02985, 2019.
    Findings
  • Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. ICLR, abs/1806.09055, 2019.
    Google ScholarLocate open access versionFindings
  • Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet V2: practical guidelines for efficient CNN architecture design. 2018.
    Google ScholarFindings
  • Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian D. Reid. Fast neural architecture search of compact semantic segmentation models via auxiliary cells. CoRR, abs/1810.10804, 2018.
    Findings
  • Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. Megdet: A large mini-batch object detector. In CVPR, pages 6181–6189, 2018.
    Google ScholarLocate open access versionFindings
  • Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In ICML, pages 4092–4101, 2018.
    Google ScholarLocate open access versionFindings
  • Zheng Qin, Zeming Li, Zhaoning Zhang, Yiping Bao, Gang Yu, Yuxing Peng, and Jian Sun. Thundernet: Towards real-time generic object detection. CoRR, abs/1903.11752.
    Findings
  • Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier architecture search. CoRR, abs/1802.01548, 2018.
    Findings
  • Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. CoRR, abs/1902.09630, 2019.
    Findings
  • Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. CoRR, abs/1902.09212, 2019.
    Findings
  • Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and Dahua Lin. Region proposal by guided anchoring. CoRR, abs/1901.03278, 2019.
    Findings
  • Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CVPR, abs/1812.03443, 2019.
    Google ScholarLocate open access versionFindings
  • Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: stochastic neural architecture search. ICLR, abs/1812.09926, 2019.
    Google ScholarLocate open access versionFindings
  • Tong Yang, Xiangyu Zhang, Zeming Li, Wenqiang Zhang, and Jian Sun. Metaanchor: Learning to detect objects with customized anchors. In NIPS, pages 318–328, 2018.
    Google ScholarLocate open access versionFindings
  • Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. CoRR, abs/1811.11168, 2018.
    Findings
  • Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016.
    Findings
  • Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. pages 8697–8710, 2018.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments