Rethinking Classification and Localization in R-CNN

arXiv: Computer Vision and Pattern Recognition, 2019.

Cited by: 5|Bibtex|Views57|Links
EI
Keywords:
connected headobject detectionfc headloss for fc-headfeature mapMore(21+)
Weibo:
We propose a Double-Head method, which has a fully connected head focusing on classification and a convolution head for bounding box regression

Abstract:

Modern R-CNN based detectors share the RoI feature extractor head for both classification and localization tasks, based upon the correlation between the two tasks. In contrast, we found that different head structures (i.e. fully connected head and convolution head) have opposite preferences towards these two tasks. Specifically, the fully...More

Code:

Data:

Introduction
  • Most two-stage object detectors [10, 11, 35, 4, 26] share a head for both classification and bounding box regression.
  • Two different head structures are widely used.
  • The authors perform a thorough comparison between the fully connected head and the convolution head on the two detection tasks, i.e. object classification and localization.
  • The authors find that these two different head structures are complementary.
  • The authors find that these two different head structures are complementary. fc-head is more
Highlights
  • Most two-stage object detectors [10, 11, 35, 4, 26] share a head for both classification and bounding box regression
  • In contrast to existing methods, which apply a single head to extract Region of Interests (RoI) features for both classification and bounding box regression tasks, we propose to split these two tasks into different heads, based upon our thorough analysis
  • Fc-head is more suitable for the classification task, while conv-head is more suitable for the localization task
  • We examine the output feature maps of both heads and find that fc-head has more spatial sensitivity than conv-head
  • Fc-head has more capability to distinguish a complete object from part of an object, but is not robust to regress the whole object
  • We propose a Double-Head method, which has a fully connected head focusing on classification and a convolution head for bounding box regression
Methods
  • Fusion Method AP

    AP0.5 AP0.75 No fusion Max

    Average unfocused tasks. For each model, the authors evaluate AP for using conv-head alone (Figure 8-(a)), using fc-head alone (Figure 8-(b)), using classification from fc-head and bounding box from conv-head (Figure 8-(c)), and using classification fusion from both heads and bounding box from conv-head (Figure 8-(d)). ωfc and ωconv are set as 2.0 and 2.5 in all.
  • The unfocused tasks are helpful as the best Double-Head-Ext model (40.3 AP) is corresponding to λfc = 0.7, λconv = 0.8 (blue box in Figure 8-(d)).
  • It outperforms Double-Head (39.8 AP, green box in Figure 8-(c)) without using unfocused tasks by 0.5 AP.
Results
  • The authors evaluate the approach on MS COCO 2017 dataset [28] and Pascal VOC07 dataset [8]. MS COCO 2017 dataset has 80 object categories.
  • Comparison with Baselines on COCO: Table 5 shows the comparison between the method with Faster RCNN [35] and FPN [26] baselines on COCO val2017
  • The authors' method outperforms both baselines on all evaluation metrics.
  • The authors' method gains 3.5+ AP on the higher IoU threshold (0.75) and 1.4+ AP on the lower IoU threshold (0.5) for both backbones
  • This demonstrates the advantage of the method with double heads
Conclusion
  • Why does fc-head show more correlation between the classification scores and proposal IoUs, and perform worse in localization? The authors believe it is because fc-head is more spatially sensitive than conv-head.
  • For conv-head whose output feature map is a 7 × 7 grid, the authors compute the spatial correlation between any pair of locations using the cosine distance between the corresponding two feature vectors.
  • Fc-head has more capability to distinguish a complete object from part of an object, but is not robust to regress the whole object
  • Based upon these findings, the authors propose a Double-Head method, which has a fully connected head focusing on classification and a convolution head for bounding box regression.
  • The authors hope that the findings are helpful for future research in object detection
Summary
  • Introduction:

    Most two-stage object detectors [10, 11, 35, 4, 26] share a head for both classification and bounding box regression.
  • Two different head structures are widely used.
  • The authors perform a thorough comparison between the fully connected head and the convolution head on the two detection tasks, i.e. object classification and localization.
  • The authors find that these two different head structures are complementary.
  • The authors find that these two different head structures are complementary. fc-head is more
  • Methods:

    Fusion Method AP

    AP0.5 AP0.75 No fusion Max

    Average unfocused tasks. For each model, the authors evaluate AP for using conv-head alone (Figure 8-(a)), using fc-head alone (Figure 8-(b)), using classification from fc-head and bounding box from conv-head (Figure 8-(c)), and using classification fusion from both heads and bounding box from conv-head (Figure 8-(d)). ωfc and ωconv are set as 2.0 and 2.5 in all.
  • The unfocused tasks are helpful as the best Double-Head-Ext model (40.3 AP) is corresponding to λfc = 0.7, λconv = 0.8 (blue box in Figure 8-(d)).
  • It outperforms Double-Head (39.8 AP, green box in Figure 8-(c)) without using unfocused tasks by 0.5 AP.
  • Results:

    The authors evaluate the approach on MS COCO 2017 dataset [28] and Pascal VOC07 dataset [8]. MS COCO 2017 dataset has 80 object categories.
  • Comparison with Baselines on COCO: Table 5 shows the comparison between the method with Faster RCNN [35] and FPN [26] baselines on COCO val2017
  • The authors' method outperforms both baselines on all evaluation metrics.
  • The authors' method gains 3.5+ AP on the higher IoU threshold (0.75) and 1.4+ AP on the lower IoU threshold (0.5) for both backbones
  • This demonstrates the advantage of the method with double heads
  • Conclusion:

    Why does fc-head show more correlation between the classification scores and proposal IoUs, and perform worse in localization? The authors believe it is because fc-head is more spatially sensitive than conv-head.
  • For conv-head whose output feature map is a 7 × 7 grid, the authors compute the spatial correlation between any pair of locations using the cosine distance between the corresponding two feature vectors.
  • Fc-head has more capability to distinguish a complete object from part of an object, but is not robust to regress the whole object
  • Based upon these findings, the authors propose a Double-Head method, which has a fully connected head focusing on classification and a convolution head for bounding box regression.
  • The authors hope that the findings are helpful for future research in object detection
Tables
  • Table1: Evaluations of detectors with different head structures on COCO val2017. The backbone is FPN with ResNet-50. The top group shows performances for single head detectors. The middle group shows performances for detectors with double heads. The weight for each loss (classification and bounding box regression) is set to 1.0. Compared to the middle group, the bottom group uses different loss weight for fc-head and conv-head (ωfc = 2.0, ωconv = 2.5). Clearly, Double-Head has the best performance, outperforming others by a non-negligible margin. Double-HeadReverse has the worst performance
  • Table2: The number of blocks (Figure 5) in the convolution head. The baseline (K = 0) is equivalent to the original FPN [<a class="ref-link" id="c26" href="#r26">26</a>] which uses fc-head alone. The first group only stacks residual blocks, while the second group alternates (K + 1)/2 residual blocks and (K − 1)/2 non-local blocks
  • Table3: Fusion of classifiers from both heads. Complementary fusion (Eq 4) outperforms others. The model is trained using weights λfc = 0.7, λconv = 0.8
  • Table4: Comparisons with FPN baseline [<a class="ref-link" id="c26" href="#r26">26</a>] on VOC07 datasets with ResNet-50 backbone. Our Double-Head-Ext outperforms FPN baseline
  • Table5: Object detection results (bounding box AP) on COCO val2017. Note that FPN baseline only has fc-head. Our Double-Head and Double-Head-Ext outperform both Faster R-CNN and FPN baselines on two backbones (ResNet-50 and ResNet-101)
  • Table6: Object detection results (bounding box AP), vs. state-of-the-art on COCO test-dev. All methods are in the family of two-stage detectors with a single training stage. Our Double-Head-Ext achieves the best performance
Download tables as Excel
Related work
  • One-stage Object Detectors: OverFeat [37] detects objects by sliding windows on feature maps. SSD [29, 9] and YOLO [32, 33, 34] have been tuned for speed by predicting object classes and locations directly. RetinaNet [27] alleviates the extreme foreground-background class imbalance problem by introducing focal loss. Point-based methods [21, 22, 47, 7, 48] model an object as keypoints (corner, center, etc), and are built on keypoint estimation networks. Two-stage Object Detectors: RCNN [12] applies a deep neural network to extract features from proposals generated by selective search [42]. SPPNet [14] speeds up RCNN significantly using spatial pyramid pooling. Fast RCNN [10] improves the speed and performance utilizing a differentiable RoI Pooling. Faster RCNN [35] introduces Region Proposal Network (RPN) to generate proposals. RFCN [4] employs position sensitive RoI pooling to address the translation-variance problem. FPN [26] builds a topdown architecture with lateral connections to extract features across multiple layers. Backbone Networks: Fast RCNN [10] and Faster RCNN [35] extract features from conv4 of VGG-16 [38], while FPN [26] utilizes features from multiple layers (conv2 to conv5) of ResNet [15]. Deformable ConvNets [5, 49] propose deformable convolution and deformable Region of Interests (RoI) pooling to augment spatial sampling locations. Trident Network [24] generates scale-aware feature maps with multi-branch architecture. MobileNet [17, 36] and ShuffleNet [46, 30] introduce efficient operators (like depthwise convolution, group convolution, channel shuffle, etc) to speed up on mobile devices. Detection Heads: Light-Head RCNN [25] introduces an efficient head network with thin feature maps. Cascade RCNN [3] constructs a sequence of detection heads trained with increasing IoU thresholds. Feature Sharing Cascade RCNN [23] utilizes feature sharing to ensemble multi-stage outputs from Cascade RCNN [3] to improve the results. Mask RCNN [13] introduces an extra head for instance segmentation. COCO Detection 18 Challenge winner (Megvii) [1] couples bounding box regression and instance segmentation in a convolution head. IoU-Net [20] introduces a branch to predict IoUs between detected bounding boxes and their corresponding ground truth boxes. Similar to IoUNet, Mask Scoring RCNN [18] presents an extra head to predict Mask IoU scores for each segmentation mask. He et. al. [16] learns uncertainties of bounding box prediction with an extra task to improve the localization results. Learning-to-Rank [39] utilizes an extra head to produce a rank value of a proposal for Non-Maximum Suppression (NMS). Zhang and Wang [45] point out that there exist misalignments between classification and localization task domains. In contrast to existing methods, which apply a single head to extract Region of Interests (RoI) features for both classification and bounding box regression tasks, we propose to split these two tasks into different heads, based upon our thorough analysis.
Reference
  • Mscoco instance segmentation challenges 2018 megvii (face++) team.
    Google ScholarFindings
  • COCO18-Detect-Megvii.pdf, 2018.
    Google ScholarFindings
  • [2] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, pages 5561–5569, 2017.
    Google ScholarLocate open access versionFindings
  • [3] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June
    Google ScholarLocate open access versionFindings
  • [4] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems 29, pages 379–387. 2016.
    Google ScholarLocate open access versionFindings
  • [5] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 764–773, 2017.
    Google ScholarLocate open access versionFindings
  • [6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
    Google ScholarLocate open access versionFindings
  • [7] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
    Google ScholarLocate open access versionFindings
  • [8] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
    Google ScholarLocate open access versionFindings
  • [9] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
    Findings
  • [10] Ross Girshick. Fast R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), 2015.
    Google ScholarLocate open access versionFindings
  • [11] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE
    Google ScholarLocate open access versionFindings
  • Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
    Google ScholarFindings
  • [12] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
    Google ScholarLocate open access versionFindings
  • [13] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
    Google ScholarFindings
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • [16] Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides, and Xiangyu Zhang. Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2888–2897, 2019.
    Google ScholarLocate open access versionFindings
  • [17] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
    Findings
  • [18] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Huang, and Xinggang Wang. Mask Scoring R-CNN. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • [19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
    Google ScholarLocate open access versionFindings
  • [20] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yuning Jiang. Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–799, 2018.
    Google ScholarLocate open access versionFindings
  • [21] Hei Law and Jia Deng. CornerNet: Detecting Objects as Paired Keypoints. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • [22] Hei Law, Yun Teng, Olga Russakovsky, and Jia Deng. Cornernet-lite: Efficient keypoint based object detection. arXiv preprint arXiv:1904.08900, 2019.
    Findings
  • [23] Ang Li, Xue Yang, and Chongyang Zhang. Rethinking classification and localization for cascade r-cnn. In Proceedings of the British Machine Vision Conference (BMVC), 2019.
    Google ScholarLocate open access versionFindings
  • [24] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
    Google ScholarLocate open access versionFindings
  • [25] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Light-head r-cnn: In defense of two-stage object detector. arXiv preprint arXiv:1711.07264, 2017.
    Findings
  • [26] T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, July 2017.
    Google ScholarLocate open access versionFindings
  • [27] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • [28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755.
    Google ScholarLocate open access versionFindings
  • [29] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37.
    Google ScholarLocate open access versionFindings
  • [30] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018.
    Google ScholarLocate open access versionFindings
  • [31] Francisco Massa and Ross Girshick. maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn-benchmark, 2018.
    Findings
  • [32] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
    Google ScholarLocate open access versionFindings
  • [33] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
    Google ScholarLocate open access versionFindings
  • [34] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
    Findings
  • [35] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • [36] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
    Google ScholarLocate open access versionFindings
  • [37] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Robert Fergus, and Yann Lecun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, 2014.
    Google ScholarLocate open access versionFindings
  • [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • [39] Zhiyu Tan, Xuecheng Nie, Qi Qian, Nan Li, and Hao Li. Learning to rank proposals for object detection. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
    Google ScholarLocate open access versionFindings
  • [40] Lachlan Tychsen-Smith and Lars Petersson. Denet: Scalable real-time object detection with directed sparse sampling. In Proceedings of the IEEE International Conference on Computer Vision, pages 428–436, 2017.
    Google ScholarLocate open access versionFindings
  • [41] Lachlan Tychsen-Smith and Lars Petersson. Improving object localization with fitness nms and bounded iou loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6877–6885, 2018.
    Google ScholarLocate open access versionFindings
  • [42] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
    Google ScholarLocate open access versionFindings
  • [43] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
    Google ScholarLocate open access versionFindings
  • [44] Hongyu Xu, Xutao Lv, Xiaoyu Wang, Zhou Ren, Navaneeth Bodla, and Rama Chellappa. Deep regionlets for object detection. In The European Conference on Computer Vision (ECCV), September 2018.
    Google ScholarLocate open access versionFindings
  • [45] Haichao Zhang and Jianyu Wang. Towards adversarially robust object detection. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
    Google ScholarLocate open access versionFindings
  • [46] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
    Google ScholarLocate open access versionFindings
  • [47] Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
    Findings
  • [48] Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. Bottom-up object detection by grouping extreme and center points. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • [49] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. CVPR, 2019.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments