Object Detection Networks on Convolutional Feature Maps

IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1476-1481, 2017.

Cited by: 316|Views4138
EI WOS
Weibo:
We discover that deep convolutional classifiers are just as important as deep convolutional feature extractors

Abstract:

Most object detectors contain two important components: a feature extractor and an object classifier. The feature extractor has rapidly evolved with significant research efforts leading to better deep convolutional architectures. The object classifier, however, has not received much attention and many recent systems (like SPPnet and Fast/...More

Code:

Data:

0
Introduction
  • Most object detectors contain two important components: a feature extractor and an object classifier.
  • The feature extractor in traditional object detection methods is a hand-engineered module, such as HOG [1].
  • An R-CNN can be thought of as a convolutional feature extractor, ending at the last pooling layer, followed by a multi-layer perceptron (MLP) classifier.
  • This methodology, appears rather different from traditional methods
Highlights
  • Most object detectors contain two important components: a feature extractor and an object classifier
  • We report that superior image classification backbones (e.g., ResNets and GoogLeNets) do not directly lead to better object detection accuracy, and a deep, convolutional Networks on Convolutional feature maps" is an essential element for outstanding detection performance, in addition to Faster R-CNN and extremely deep ResNets
  • We investigate two sets of training images: (i) the original trainval set of 5k images in VOC 2007, and an augmented set of 16k images that consists of VOC 2007 trainval images and VOC 2012 trainval images, following [27]
  • A convolutional region-wise classifier is more effective than an multi-layer perceptron-based region-wise classifier. These observations are strongly supported by the experimental results on the more challenging MS COCO dataset (Table 8), as we introduced
  • We discover that deep convolutional classifiers are just as important as deep convolutional feature extractors
  • Fig. 3 shows that VGG-16 in general has lower recognition error than the Zeiler and Fergus net, when using the same classifiers (e.g., 1.6%+1.3%+7.4% vs. 3.2%+2.2%+7.4%)
  • Based on the observations from the Networks on Convolutional feature maps" perspective, we present a way of using Faster R-CNN with ResNets, which achieves nontrivial results on challenging datasets including MS COCO
Results
  • Experimental Results Table

    8 shows the results on MS COCO val. The authors discuss by diving the results into 3 cases as following.

    Naıve Faster R-CNN.
  • 8 shows the results on MS COCO val.
  • The authors discuss by diving the results into 3 cases as following.
  • Naıve Faster R-CNN.
  • By this the authors mean that the RoI pooling layer is naıvely adopted after the last convolutional layer.
  • The authors set the output resolution of RoI pooling as 7×7.
  • This is followed by a 81-d classifier
Conclusion
  • The authors delve into the detection systems and provide insights about the region-wise classifiers.
  • Based on the observations from the NoC perspective, the authors present a way of using Faster R-CNN with ResNets, which achieves nontrivial results on challenging datasets including MS COCO
Summary
  • Introduction:

    Most object detectors contain two important components: a feature extractor and an object classifier.
  • The feature extractor in traditional object detection methods is a hand-engineered module, such as HOG [1].
  • An R-CNN can be thought of as a convolutional feature extractor, ending at the last pooling layer, followed by a multi-layer perceptron (MLP) classifier.
  • This methodology, appears rather different from traditional methods
  • Results:

    Experimental Results Table

    8 shows the results on MS COCO val. The authors discuss by diving the results into 3 cases as following.

    Naıve Faster R-CNN.
  • 8 shows the results on MS COCO val.
  • The authors discuss by diving the results into 3 cases as following.
  • Naıve Faster R-CNN.
  • By this the authors mean that the RoI pooling layer is naıvely adopted after the last convolutional layer.
  • The authors set the output resolution of RoI pooling as 7×7.
  • This is followed by a 81-d classifier
  • Conclusion:

    The authors delve into the detection systems and provide insights about the region-wise classifiers.
  • Based on the observations from the NoC perspective, the authors present a way of using Faster R-CNN with ResNets, which achieves nontrivial results on challenging datasets including MS COCO
Tables
  • Table1: Detection mAP (%) of NoC as MLP for PASCAL VOC 07 using a ZF net. The training set is
  • Table2: Detection mAP (%) of NoC as ConvNet for PASCAL VOC 07 using a ZF net. The training sets are PASCAL VOC 07 trainval and 07+12 trainval respectively. The NoCs are randomly initialized. No bbox regression is used
  • Table3: Detection mAP (%) of maxout NoC for PASCAL VOC 07 using a ZF net. The training set is 07+12 trainval. The NoCs are randomly initialized. No bbox regression is used
  • Table4: Detection mAP (%) of NoC for PASCAL VOC 07 using ZF/VGG-16 nets with different initialization. The training sets are PASCAL VOC 07 trainval and PASCAL VOC 07+12 trainval respectively. No bounding box regression is used
  • Table5: Detection results for PASCAL VOC 07 using VGG nets. The training set is PASCAL VOC 07+12 trainval. The NoC is the fine-tuned version (Sec. 3.4). No bounding box regression is used
  • Table6: Detection results for the PASCAL VOC 2007 test set using the VGG-16 model [<a class="ref-link" id="c15" href="#r15">15</a>]. Here “bb” denotes post-hoc bounding box regression [<a class="ref-link" id="c6" href="#r6">6</a>]
  • Table7: Detection results for the PASCAL VOC 2012 test set using the VGG-16 model [<a class="ref-link" id="c15" href="#r15">15</a>]. Here “bb” denotes post-hoc bounding box regression [<a class="ref-link" id="c6" href="#r6">6</a>]
  • Table8: Detection results of Faster R-CNN on the MS COCO val set. “inc” indicates an inception block, and “res” indicates a residual block
  • Table9: Detection results of Faster R-CNN + ResNet101 on MS COCO val (trained on MS COCO train) and PASCAL VOC 2007 test (trained on 07+12), based on different NoC structures
Download tables as Excel
Related work
  • Traditional Object Detection. Research on object detection in general focuses on both features and classifiers. The pioneering work of Viola and Jones [19] uses simple Haar-like features and boosted classifiers on sliding windows. The pedestrian detection method in [1] proposes HOG features used with linear SVMs. The DPM method [2] develops deformable graphical models and latent SVM as a sliding-window classifier. The Selective Search paper [4] relies on spatial pyramid features [20] on dense SIFT vectors [21] and an additive kernel SVM. The Regionlet method [3] learns boosted classifiers on HOG and other features.
Funding
  • Compared with the SVM classifier trained on the RoI features (“SVM on RoI”, equivalent to a 1-fc structure), the 4-fc NoC as a classifier on the same features has 7.8% higher mAP
  • Table 3 shows the mAP of the four variants of maxout NoCs. Their mAP is higher than that of the non-maxout counterpart, by up to 1.8% mAP
  • Fig. 3 shows that VGG-16 in general has lower recognition error than the ZF net, when using the same classifiers (e.g., 1.6%+1.3%+7.4% vs. 3.2%+2.2%+7.4%)
  • On the other hand, when using a stronger NoC (maxout 2conv3fc), the localization error is substantially reduced compared with the 3fc baseline (22.6% vs. 28.1% with ZF, and 20.1% vs. 24.8% with VGG-16)
  • Our method achieves 71.6% mAP on the PASCAL VOC 2007 test set
Study subjects and analysis
cases: 3
Experimental Results Table 8 shows the results on MS COCO val. We discuss by diving the results into 3 cases as following. Naıve Faster R-CNN

Reference
  • N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
    Google ScholarFindings
  • P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained partbased models,” TPAMI, 2010.
    Google ScholarLocate open access versionFindings
  • X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic object detection,” in ICCV, 2013.
    Google ScholarFindings
  • J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” IJCV, 2013.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
    Google ScholarFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
    Google ScholarFindings
  • P.-A. Savalle, S. Tsogkas, G. Papandreou, and I. Kokkinos, “Deformable part models with CNN features,” in Parts and Attributes Workshop, ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • R. Girshick, F. Iandola, T. Darrell, and J. Malik, “Deformable part models are convolutional neural networks,” in CVPR, 2015.
    Google ScholarFindings
  • L. Wan, D. Eigen, and R. Fergus, “End-to-end integration of a convolutional network, deformable parts model and nonmaximum suppression,” in CVPR, 2015.
    Google ScholarFindings
  • W. Y. Zou, X. Wang, M. Sun, and Y. Lin, “Generic object detection with dense neural patterns and regionlets,” in BMVC, 2014.
    Google ScholarFindings
  • K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in ECCV, 2014.
    Google ScholarFindings
  • R. Girshick, “Fast R-CNN,” in ICCV, 2015.
    Google ScholarFindings
  • S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in NIPS, 2015.
    Google ScholarFindings
  • K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv:1302.4389, 2013.
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in arXiv prepring arXiv:1506.01497, 2015.
    Findings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, and A. Rabinovich, “Going deeper with convolutions,” Tech. Rep., 2014. [Online]. Available: http://arxiv.org/pdf/1409.4842v1
    Findings
  • P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in CVPR, 2001.
    Google ScholarFindings
  • S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR, 2006.
    Google ScholarFindings
  • D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in ICLR, 2014.
    Google ScholarFindings
  • K. Lenc and A. Vedaldi, “R-cnn minus r,” in BMVC, 2015.
    Google ScholarFindings
  • S. Gidaris and N. Komodakis, “Object detection via a multiregion & semantic segmentation-aware cnn model,” in ICCV, 2015.
    Google ScholarFindings
  • M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in ECCV, 2014.
    Google ScholarFindings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes (VOC) Challenge,” IJCV, 2010.
    Google ScholarLocate open access versionFindings
  • P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance of multilayer neural networks for object recognition,” in ECCV, 2014.
    Google ScholarFindings
  • K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, 1989.
    Google ScholarLocate open access versionFindings
  • V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in ICML, 2010.
    Google ScholarFindings
  • D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in object detectors,” in ECCV, 2012.
    Google ScholarFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” arXiv:1405.0312, 2014.
    Findings
  • S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
    Google ScholarFindings
  • J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
    Google ScholarFindings
  • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” in ICLR, 2015.
    Google ScholarFindings
  • S. Mallat, A wavelet tour of signal processing. Academic press, 1999.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments