Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

european conference on computer vision, pp. 346-361, 2014.

Cited by: 5814|Views62
Weibo:
We introduce a spatial pyramid pooling, layer to remove the fixed-size constraint of the network

Abstract:

Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to elimi...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224×224) input image.
  • Using SPP-net, the authors compute the feature maps from the entire image only once, and pool features in arbitrary regions to generate fixed-length representations for training the detectors.
  • Convolutional layers do not require a fixed image size and can generate feature maps of any sizes.
  • The authors introduce a spatial pyramid pooling (SPP) [14], [15] layer to remove the fixed-size constraint of the network.
Highlights
  • Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224×224) input image
  • We introduce a spatial pyramid pooling (SPP) [14], [15] layer to remove the fixed-size constraint of the network
  • VOC 2007 and Caltech101 results: SPP-net is better than the no-SPP net (Table 7 (b) vs. (a)), and the fullview representation is better than the crop ((c) vs. (b))
  • An explanation is that our fc layers are pretrained using image regions, while in the detection case they are used on the feature map regions
  • Our 5-scale result (59.2%) is 0.7% better than R-CNN (58.5%), and our 1-scale result (58.0%)
  • We have suggested a solution to train a deep network with a spatial pyramid pooling layer
Results
  • The authors show that the authors can run the convolutional layers only once on the entire image, and extract features by SPP-net on the feature maps.
  • 256-d spatial pyramid pooling layer feature maps of conv5 convolutional layers input image fixed-length vectors.
  • The authors first consider a network taking a fixed-size input (224×224) cropped from images.
  • The feature maps after each layer have the same size as ZF-5.
  • The authors' method is the first one that trains a single network with input images of multiple sizes.
  • In the Overfeat paper [5], the views are extracted from the convolutional feature maps instead of image crops.
  • Fixed-length representation spatial pyramid pooling layer feature maps of conv5 window
  • A pre-trained deep network is used to extract the feature of each window.
  • Because R-CNN repeatedly applies the deep convolutional network to about 2,000 windows per image, it is time-consuming.
  • The Overfeat detection method [5] extracts from windows of deep convolutional feature maps, but needs to predefine the window size.
  • The authors use a 4-level spatial pyramid (1×1, 2×2, 3×3, 6×6, totally bins) to pool the features.
  • The authors' method only requires computing the feature maps once from the entire image, regardless of the number of candidate windows.
  • Since the features are pooled from the conv5 feature maps from windows of any sizes, for simplicity the authors only fine-tune the fully-connected layers.
Conclusion
  • (0.5 for conv, 0.1 for fc, excluding proposals) per testing image on a GPU extracting convolutional features from all 5 scales.
  • For the 40k testing images, the method requires 8 GPU·hours to compute convolutional features, while RCNN would require 15 GPU·days.
  • The authors have suggested a solution to train a deep network with a spatial pyramid pooling layer.
  • A window is given in the image domain, but the authors use it to crop the convolutional feature maps
Summary
  • Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224×224) input image.
  • Using SPP-net, the authors compute the feature maps from the entire image only once, and pool features in arbitrary regions to generate fixed-length representations for training the detectors.
  • Convolutional layers do not require a fixed image size and can generate feature maps of any sizes.
  • The authors introduce a spatial pyramid pooling (SPP) [14], [15] layer to remove the fixed-size constraint of the network.
  • The authors show that the authors can run the convolutional layers only once on the entire image, and extract features by SPP-net on the feature maps.
  • 256-d spatial pyramid pooling layer feature maps of conv5 convolutional layers input image fixed-length vectors.
  • The authors first consider a network taking a fixed-size input (224×224) cropped from images.
  • The feature maps after each layer have the same size as ZF-5.
  • The authors' method is the first one that trains a single network with input images of multiple sizes.
  • In the Overfeat paper [5], the views are extracted from the convolutional feature maps instead of image crops.
  • Fixed-length representation spatial pyramid pooling layer feature maps of conv5 window
  • A pre-trained deep network is used to extract the feature of each window.
  • Because R-CNN repeatedly applies the deep convolutional network to about 2,000 windows per image, it is time-consuming.
  • The Overfeat detection method [5] extracts from windows of deep convolutional feature maps, but needs to predefine the window size.
  • The authors use a 4-level spatial pyramid (1×1, 2×2, 3×3, 6×6, totally bins) to pool the features.
  • The authors' method only requires computing the feature maps once from the entire image, regardless of the number of candidate windows.
  • Since the features are pooled from the conv5 feature maps from windows of any sizes, for simplicity the authors only fine-tune the fully-connected layers.
  • (0.5 for conv, 0.1 for fc, excluding proposals) per testing image on a GPU extracting convolutional features from all 5 scales.
  • For the 40k testing images, the method requires 8 GPU·hours to compute convolutional features, while RCNN would require 15 GPU·days.
  • The authors have suggested a solution to train a deep network with a spatial pyramid pooling layer.
  • A window is given in the image domain, but the authors use it to crop the convolutional feature maps
Tables
  • Table1: Network architectures: filter number×filter size (e.g., 96 × 72), filter stride (e.g., str 2), pooling window size (e.g., pool 32), and the output feature map size (e.g., map size 55 × 55). LRN represents Local Response
  • Table2: Error rates in the validation set of ImageNet 2012. All the results are obtained using standard 10-view testing. In the brackets are the gains over the “no SPP” baselines
  • Table3: Error rates in the validation set of ImageNet. Here we evaluate ZF-5/Overfeat-7. The top-1 420 using a single view. The images are resized so error rates are all reduced by the full-view represen- 421 min(w, h) = 256. The crop view is the central 224×224 tation. This shows the importance of maintaining the 422 of the image
  • Table4: Error rates in ImageNet 2012. All the results are based on a single network. The number of views in Overfeat depends on the scales and strides, for which there are several hundreds at the finest scale
  • Table5: The competition results of ILSVRC 2014 classification [<a class="ref-link" id="c26" href="#r26">26</a>]. The best entry of each team is listed
  • Table6: Classification mAP in Pascal VOC 2007. For SPP-net, the pool5/7 layer uses the 6×6 pyramid level. d)) based on the validation set. This is mainly accurate, and the SPP layers are better. This is possibly because the objects occupy smaller regions in VOC because the object categories in Caltech101 are less
  • Table7: Classification accuracy in Caltech101. For SPP-net, the pool5/7 layer uses the 6×6 pyramid level
  • Table8: Classification results for Pascal VOC 2007 (mAP) and Caltech101 (accuracy). †numbers reported by [<a class="ref-link" id="c27" href="#r27">27</a>]. ‡our implementation as in Table 6 (a)
  • Table9: Detection results (mAP) on Pascal VOC 2007
  • Table10: Table 10
  • Table11: Comparisons of detection results on Pascal VOC 2007
  • Table12: Detection results on VOC 2007 using model combination. The results of both models use “ftfc7 bb”. SPP-net (2)) shows the results of this net- 792 faster than R-CNN and is 1.2% inferior; our 5-scale work. Its mAP is comparable with the first network 793 version is 38× faster and has comparable results
  • Table13: The competition results of ILSVRC 2014 detection (provided-data-only track) [<a class="ref-link" id="c26" href="#r26">26</a>]. The best entry of each team is listed
Download tables as Excel
Funding
  • We investigate four different network architectures in existing publications [3], [4], [5] (or their modifications), and we show SPP improves the accuracy of all these architectures
  • Drops to 29.68%, which is 2.33% better than its noSPP counterpart and 0.68% better than its single-size trained counterpart
  • We empirically find that (discussed in the next subsection) even for the combination of dozens of views, the additional two full-image views (with flipping) can still boost the accuracy by about 0.2%
  • Our best single network achieves 9.14% top-5 error on the validation set
  • In Table 6 (e) the network architecture is replaced with our best model (Overfeat-7, multi-size trained), and the mAP increases to 82.44%
  • On the SPP (ZF-5) model, the accuracy is 89.91% using the SPP layer as features - lower than 91.44% which uses the same model on the undistorted full image
  • Any negative sample is removed if it overlaps another negative sample by more than 70%
  • After bounding box regression, our 5-scale result (59.2%) is 0.7% better than R-CNN (58.5%), and our 1-scale result (58.0%)
  • The Regionlet method improves to 46.1% [8] by combining various features including conv5
  • (59.1% vs. 59.2%), and outperforms the first network in 11 categories
  • Replaced with a 200-category network pre-trained on DET, the mAP significantly drops to 32.7%
  • A 499-category pre-trained network improves the result to 35.9%
  • Training with min(w, h) = 400 instead of 256 further improves the mAP to 37.8%
Study subjects and analysis
teams: 38
24-102× faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition

teams: 38
A preliminary version of this manuscript has been published in ECCV 2014. Based on this work, we attended the competition of ILSVRC 2014 [26], and ranked #2 in object detection and #3 in image classification (both are provided-data-only tracks) among all 38 teams. There are a few modifications made for ILSVRC 2014

teams: 38
2012). After combining eleven models, our team’s result (8.06%) is ranked #3 among all 38 teams attending. ILSVRC 2014 (Table 5)

Reference
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, 1989.
    Google ScholarLocate open access versionFindings
  • [3] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • [4] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” arXiv:1311.2901, 2013.
    Findings
  • [5] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv:1312.6229, 2013.
    Findings
  • [7] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
    Google ScholarFindings
  • [10] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: with convolutions,” arXiv:1409.4842, 2014.
    Findings
  • [34] M. Oquab, L. Bottou, I. Laptev, J. Sivic et al., “Learning and 1030 transferring mid-level image representations using convolu- 1031 tional neural networks,” in CVPR, 2014.
    Google ScholarFindings
  • [39] X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic 1045 object detection,” in ICCV, 2013.
    Google ScholarFindings
  • [40] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks 1047 for object detection,” in NIPS, 2013.
    Google ScholarFindings
  • [13] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” arXiv:1310.1531, 2013.
    Findings
  • [15] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR, 2006.
    Google ScholarFindings
  • [16] J. Sivic and A. Zisserman, “Video google: a text retrieval approach to object matching in videos,” in ICCV, 2003.
    Google ScholarFindings
  • [19] F. Perronnin, J. Sanchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in ECCV, 2010.
    Google ScholarFindings
  • [21] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” CVIU, 2007.
    Google ScholarLocate open access versionFindings
  • [24] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
    Google ScholarFindings
  • [25] C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • [27] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil is in the details: an evaluation of recent feature encoding methods,” in BMVC, 2011.
    Google ScholarFindings
  • [28] A. Coates and A. Ng, “The importance of encoding versus training with sparse coding and vector quantization,” in ICML, 2011.
    Google ScholarFindings
  • [29] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments