SM-NAS: Structural-to-Modular Neural Architecture Search for Object Detection

AAAI, pp. 12661-12668, 2020.

Cited by: 1|Bibtex|Views104|Links
EI
Keywords:
sm nasmodular levelneural architecture searchefficient combinationinference timeMore(9+)
Weibo:
We propose a detection Neural architecture search framework for searching both an efficient combination of modules and better modular-level architectures for object detection on a target device

Abstract:

The state-of-the-art object detection method is complicated with various modules such as backbone, feature fusion neck, RPN and RCNN head, where each module may have different designs and structures. How to leverage the computational cost and accuracy trade-off for the structural combination as well as the modular selection of multiple ...More

Code:

Data:

0
Introduction
  • Real-time object detection is a core and challenging task to localize and recognize objects in an image on a certain device.
  • How to select the best combination of modules under hardware resource constrains remains unknown.
  • Empirically the authors found that combination of Cascade-RCNN with ResNet18 is even faster and more accurate than FPN with ResNet50 in COCO [29] and BDD [59].
  • This is not true in the case of VOC[12]
Highlights
  • Real-time object detection is a core and challenging task to localize and recognize objects in an image on a certain device
  • By investigating the state-of-the-art design, we found three factors are crucial for the performance of a detection system: 1) size of the input images; 2) combination of modules of the detector; 3) architecture within each module
  • To find an optimal tradeoff between inference time and accuracy with these three factors, we propose a coarse-to-fine searching strategy: 1) Structural-level searching stage (Stage-one) first aims to find an efficient combination of different modules as well as the model-matching input sizes; 2) Modularlevel search stage (Stage-two) evolves each specific module and push forward to an efficient task-specific network
  • Our E2 reaches half of the inference time with additional 1% mAP improvement compared to FPN
  • As in Figure 2, we propose a coarse-to-fine searching pipeline: 1) Structural-level searching stage first aims to find an efficient combination of different modules; 2) Modular-level search stage evolves each specific module and push forward to a faster task-specific network
  • We propose a detection Neural architecture search (NAS) framework for searching both an efficient combination of modules and better modular-level architectures for object detection on a target device
Methods
  • Input size Inf time AP AP50 AP75 APS APM APL YOLO v3[44] DarkNet-53 DSSD513[13] ResNet101 RetinaNet[28] ResNet101-FPN FSAF[62] CornerNet[19] Hourglass-104 CenterNet[61]
Results
  • On the COCO dataset, the optimal architectures E0 to E5 are identified with the two-stages search.
  • The authors first pre-train those searched backbones on ImageNet following common practice [17] for fair comparison with other methods.
  • Figure 8 shows the correlation coefficients between factors of all the searched models on COCO dataset for all the Pareto fronts in Stage-two.
  • The mAP is positive correlated to the depth.
  • From Figure 11, for a easier dataset Pascal, our E3 performs very well
Conclusion
  • The authors propose a detection NAS framework for searching both an efficient combination of modules and better modular-level architectures for object detection on a target device.
  • The searched SM-NAS networks achieve state-ofthe-art speed/accuracy trade-off.
  • The SM-NAS pipeline can keep updating and adding new modules in the future
Summary
  • Introduction:

    Real-time object detection is a core and challenging task to localize and recognize objects in an image on a certain device.
  • How to select the best combination of modules under hardware resource constrains remains unknown.
  • Empirically the authors found that combination of Cascade-RCNN with ResNet18 is even faster and more accurate than FPN with ResNet50 in COCO [29] and BDD [59].
  • This is not true in the case of VOC[12]
  • Methods:

    Input size Inf time AP AP50 AP75 APS APM APL YOLO v3[44] DarkNet-53 DSSD513[13] ResNet101 RetinaNet[28] ResNet101-FPN FSAF[62] CornerNet[19] Hourglass-104 CenterNet[61]
  • Results:

    On the COCO dataset, the optimal architectures E0 to E5 are identified with the two-stages search.
  • The authors first pre-train those searched backbones on ImageNet following common practice [17] for fair comparison with other methods.
  • Figure 8 shows the correlation coefficients between factors of all the searched models on COCO dataset for all the Pareto fronts in Stage-two.
  • The mAP is positive correlated to the depth.
  • From Figure 11, for a easier dataset Pascal, our E3 performs very well
  • Conclusion:

    The authors propose a detection NAS framework for searching both an efficient combination of modules and better modular-level architectures for object detection on a target device.
  • The searched SM-NAS networks achieve state-ofthe-art speed/accuracy trade-off.
  • The SM-NAS pipeline can keep updating and adding new modules in the future
Tables
  • Table1: Preliminary empirical experiments. Inference time is tested on one V100 GPU. The performance of a detection model is highly related to the dataset (Exp1-4). Better combination of modules and input resolution can leads to an efficient detection system (Exp 5&6)
  • Table2: Detailed architecture of the final SM-NAS models from E0 to E5. For the backbone, basicblock and bottleneck follow the same as in ResNet [<a class="ref-link" id="c18" href="#r18">18</a>] and Xbottleneck refers to the block setting of ResNeXt [<a class="ref-link" id="c56" href="#r56">56</a>]. For Neck, P2-P5 and “c” denotes the choice and the channels of output feature levels in FPN. For RCNN head, “2FC” is the regular setting of two shared fully connected layer; “n” means the stages of the cascade head
  • Table3: Comparison of mAP of the state-of-the-art single-model on COCO test-dev. Our searched models dominate most SOTA models in terms of speed/accuracy by a large margin
  • Table4: FPN with ResNet-50 trained with different strategies, evaluated on COCO val. “GN” is group normalization by [<a class="ref-link" id="c54" href="#r54">54</a>]. “WS” is the Weight Standardization method by [<a class="ref-link" id="c38" href="#r38">38</a>]. We found that with group normalization, Weight Standardization, larger learning rate and batchsize, we can train a detection network from scratch using less epochs than standard training procedure
  • Table5: Transferability of our models on PASCAL VOC (VOC) and Berkeley Deep Drive dataset (BDD)
Download tables as Excel
Related work
  • Object Detection. Object detection is a core problem in computer vision. State-of-the-art anchor-based detection approaches usually consists of four modules: backbone, feature fusion neck, region proposal network (in two-stage detectors), and RCNN head. Most of the previous progress focus on developing better architectures for each module. For example, [24] tries to develop a backbone for detection; FPN [25] and PANet [34] modified multi-level features fusion module; [52] try to make RPN more powerful. On the other hand, R-FCN [11] and Light-head RCNN [23] design different structures of bbox head. However, community lacks of literatures comparing the efficiency and performance of different combination of different modules.
Reference
  • Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016. 2
    Findings
  • Chandrasekhar Bhagavatula, Chenchen Zhu, Khoa Luu, and Marios Savvides. Faster than real-time facial alignment: A 3d spatial transformer network approach in unconstrained poses. In ICCV, 2017. 1
    Google ScholarLocate open access versionFindings
  • Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by network transformation. In AAAI, 2018. 2
    Google ScholarLocate open access versionFindings
  • Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. In ICLR2019, 2019. 2
    Google ScholarLocate open access versionFindings
  • Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR, 2018. 1, 3.2.1, 4.1.2, 4.2
    Google ScholarLocate open access versionFindings
  • Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa, Celine Teuliere, and Thierry Chateau. Deep manta: A coarseto-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In CVPR, 2017. 1
    Google ScholarLocate open access versionFindings
  • Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. mmdetection. https://github.com/open-mmlab/mmdetection, 2018.4.1
    Findings
  • Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jon Shlens. Searching for efficient multi-scale architectures for dense image prediction. In NIPS, 2012
    Google ScholarLocate open access versionFindings
  • Yuntao Chen, Chenxia Han, Naiyan Wang, and Zhaoxiang Zhang. Revisiting feature alignment for one-stage object detection. arXiv preprint arXiv:1908.01570, 2014.1.2, 4.2
    Findings
  • Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Chunhong Pan, and Jian Sun. Detnas: Neural architecture search on object detection. arXiv preprint arXiv:1903.10979, 2019. 1, 2, 4.1.2
    Findings
  • Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010. 1, 4.2
    Google ScholarLocate open access versionFindings
  • Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector. In ICCV, 2017. 4.1.2, 4.2
    Google ScholarLocate open access versionFindings
  • Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V. Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In CVPR, 2019. 1, 4.1.2, 4.2
    Google ScholarLocate open access versionFindings
  • Kaiming He, Ross Girshick, and Piotr Dollar. Rethinking imagenet pre-training. arXiv preprint arXiv:1811.08883, 2018. 1, 3.3, 4.2, 4.2
    Findings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 4.1.2, 4.2
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 1, 3.1, 3.2.1, 4.2
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018. 4.1.2, 4.2
    Google ScholarLocate open access versionFindings
  • Xin Li, Yiming Zhou, Zheng Pan, and Jiashi Feng. Partial order pruning: for best speed/accuracy trade-off in neural architecture search. In CVPR, pages 9145–9153, 2019. 1, 3.2.2, 3.4, 4.1.1
    Google ScholarLocate open access versionFindings
  • Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. arXiv preprint arXiv:1901.01892, 2019. 1
    Findings
  • Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. arXiv preprint arXiv:1901.01892, 2019. 1, 4.1.2, 4.2
    Findings
  • Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Light-head r-cnn: In defense of twostage object detector. In CVPR, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet: A backbone network for object detection. In ECCV, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 1, 2, 3.1
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 3.2.1
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017. 1, 3.2.1
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. TPAMI, 2018. 3.2.1, 4.1.2, 4.2
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 1, 4.1, 4.2
    Google ScholarLocate open access versionFindings
  • Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. arXiv preprint arXiv:1901.02985, 2019. 2
    Findings
  • Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In ECCV, 2018. 1
    Google ScholarLocate open access versionFindings
  • Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017. 2
    Findings
  • Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In ICLR, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In CVPR, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, 2016. 1
    Google ScholarLocate open access versionFindings
  • Ping Luo, Yonglong Tian, Xiaogang Wang, and Xiaoou Tang. Switchable deep network for pedestrian detection. In CVPR, 2014. 1
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017. 4.1
    Google ScholarLocate open access versionFindings
  • Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Weight standardization. arXiv preprint arXiv:1903.10520, 2019. 3.3, 4
    Findings
  • Zheng Qin, Zeming Li, Zhaoning Zhang, Yiping Bao, Gang Yu, Yuxing Peng, and Jian Sun. Thundernet: Towards real-time generic object detection. arXiv preprint arXiv:1903.11752, 2019. 3.2.1, 3.2.2
    Findings
  • Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018. 1, 2
    Findings
  • Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Kurakin. Large-scale evolution of image classifiers. In ICML, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016. 1
    Google ScholarLocate open access versionFindings
  • Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In CVPR, 2017. 1
    Google ScholarLocate open access versionFindings
  • Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 4.1.2, 4.2
    Findings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 4.1.2, 4.2
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015. 3.2.1
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015. 3.2.1, 3.3, 4.1.1
    Google ScholarLocate open access versionFindings
  • Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, pages 4510–4520, 2018. 3.2.1, 3.2.2
    Google ScholarLocate open access versionFindings
  • Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In NIPS, pages 2483–2493, 2018. 3.3
    Google ScholarLocate open access versionFindings
  • Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xiangyang Xue. Dsod: Learning deeply supervised object detectors from scratch. In ICCV, pages 1919–1927, 2017. 3.3
    Google ScholarLocate open access versionFindings
  • Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019. 1
    Findings
  • Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and Dahua Lin. Region proposal by guided anchoring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2965–2974, 2019. 1, 2, 3.2.1, 4.1.2, 4.2
    Google ScholarLocate open access versionFindings
  • Ning Wang, Yang Gao, Hao Chen, Peng Wang, Zhi Tian, and Chunhua Shen. Nas-fcos: Fast neural architecture search for object detection. arXiv preprint arXiv:1906.04423, 2019. 1
    Findings
  • Yuxin Wu and Kaiming He. Group normalization. In ECCV, pages 3–19, 2018. 3.3, 4
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, pages 1492–1500, 2017. 3.2.1
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. In ICLR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Auto-fpn: Automatic network architecture adaptation for object detection beyond classification. In The IEEE International Conference on Computer Vision (ICCV), October 2019. 1
    Google ScholarLocate open access versionFindings
  • Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, 2018. 1, 4.2
    Findings
  • Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and ChengLin Liu. Practical block-wise neural network architecture generation. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Objects as points. arXiv preprint arXiv:1904.07850, 2019. 4.1.2, 4.2
    Findings
  • Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single-shot object detection. arXiv preprint arXiv:1903.00621, 2019. 4.1.2, 4.2
    Findings
  • Rui Zhu, Shifeng Zhang, Xiaobo Wang, Longyin Wen, Hailin Shi, Liefeng Bo, and Tao Mei. Scratchdet: Exploring to train single-shot object detectors from scratch. arXiv preprint arXiv:1810.08425, 2018. 3.3
    Findings
  • Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018. 1, 2
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments