Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Pattern Analysis and Machine Intelligence, IEEE Transactions  , Volume PP, Issue 99, 2015, Pages 1

Cited by: 4435|Views1348
EI WOS
Weibo:
We have suggested a solution to train a deep network with a spatial pyramid pooling layer

Abstract:

Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224224) input image. This requirement is “artificial” and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the...More

Code:

Data:

0
Introduction
  • The authors are witnessing a rapid, revolutionary change in the vision community, mainly caused by deep convolutional neural networks (CNNs) [1] and the availability of large scale training data [2].
  • When applied to images of arbitrary sizes, current methods mostly fit the input image to the fixed size, either via cropping [3], [4] or via warping [13], [7], as shown in Figure 1.
  • The cropped region may not contain the entire object, while the warped content may result in unwanted geometric distortion.
Highlights
  • We are witnessing a rapid, revolutionary change in our vision community, mainly caused by deep convolutional neural networks (CNNs) [1] and the availability of large scale training data [2]
  • We introduce a spatial pyramid pooling (SPP) [14], [15] layer to remove the fixed-size constraint of the network
  • Since the advantages of SPPnet should be in general independent of architectures, we expect that it will further improve the deeper and larger convolutional architectures [33], [32]
  • spatial pyramid pooling is a flexible solution for handling different scales, sizes, and aspect ratios
  • These issues are important in visual recognition, but received little consideration in the context of deep networks
  • Our 5-scale result (59.2%) is 0.7% better than R-convolutional neural networks (58.5%), and our 1-scale result (58.0%) is 0.5% worse
  • We have suggested a solution to train a deep network with a spatial pyramid pooling layer
Methods
  • Experiments on ImageNet

    2012 Classification

    The authors train the networks on the 1000-category training set of ImageNet 2012.
  • The fully-connected layers are less accurate, and the SPP layers are better
  • This is possibly because the object categories in Caltech101 are less related to those in ImageNet, and the deeper layers are more category-specialized.
  • The authors find that the scale 224 has the best performance among the scales the authors tested on this dataset
  • This is mainly because the objects in Caltech101 occupy large regions of the images, as is the case of ImageNet
Results
  • Summary and Results for

    ILSVRC 2014

    In Table 4 the authors compare with previous state-of-theart methods.
  • The authors' best single network achieves 9.14% top-5 error on the validation set.
  • This is exactly the single-model entry the authors submitted to ILSVRC 2014 [26].
  • The authors' team’s result (8.06%) is ranked #3 among all 38 teams attending ILSVRC 2014 (Table 5).
  • The feature map regions can have strong activations near the window boundaries, while the image regions may not
  • This difference of usages can be addressed by fine-tuning.
  • Our 5-scale result (59.2%) is 0.7% better than R-CNN (58.5%), and our 1-scale result (58.0%) is 0.5% worse
Conclusion
  • SPP is a flexible solution for handling different scales, sizes, and aspect ratios.
  • These issues are important in visual recognition, but received little consideration in the context of deep networks.
  • The authors have suggested a solution to train a deep network with a spatial pyramid pooling layer.
  • The authors' studies show that many time-proven techniques/insights in computer vision can still play important roles in deep-networks-based recognition
Summary
  • Introduction:

    The authors are witnessing a rapid, revolutionary change in the vision community, mainly caused by deep convolutional neural networks (CNNs) [1] and the availability of large scale training data [2].
  • When applied to images of arbitrary sizes, current methods mostly fit the input image to the fixed size, either via cropping [3], [4] or via warping [13], [7], as shown in Figure 1.
  • The cropped region may not contain the entire object, while the warped content may result in unwanted geometric distortion.
  • Methods:

    Experiments on ImageNet

    2012 Classification

    The authors train the networks on the 1000-category training set of ImageNet 2012.
  • The fully-connected layers are less accurate, and the SPP layers are better
  • This is possibly because the object categories in Caltech101 are less related to those in ImageNet, and the deeper layers are more category-specialized.
  • The authors find that the scale 224 has the best performance among the scales the authors tested on this dataset
  • This is mainly because the objects in Caltech101 occupy large regions of the images, as is the case of ImageNet
  • Results:

    Summary and Results for

    ILSVRC 2014

    In Table 4 the authors compare with previous state-of-theart methods.
  • The authors' best single network achieves 9.14% top-5 error on the validation set.
  • This is exactly the single-model entry the authors submitted to ILSVRC 2014 [26].
  • The authors' team’s result (8.06%) is ranked #3 among all 38 teams attending ILSVRC 2014 (Table 5).
  • The feature map regions can have strong activations near the window boundaries, while the image regions may not
  • This difference of usages can be addressed by fine-tuning.
  • Our 5-scale result (59.2%) is 0.7% better than R-CNN (58.5%), and our 1-scale result (58.0%) is 0.5% worse
  • Conclusion:

    SPP is a flexible solution for handling different scales, sizes, and aspect ratios.
  • These issues are important in visual recognition, but received little consideration in the context of deep networks.
  • The authors have suggested a solution to train a deep network with a spatial pyramid pooling layer.
  • The authors' studies show that many time-proven techniques/insights in computer vision can still play important roles in deep-networks-based recognition
Tables
  • Table1: Network architectures: filter number×filter size (e.g., 96 × 72), filter stride (e.g., str 2), pooling window size (e.g., pool 32), and the output feature map size (e.g., map size 55 × 55). LRN represents Local Response
  • Table2: Error rates in the validation set of ImageNet 2012. All the results are obtained using standard 10-view testing. In the brackets are the gains over the “no SPP” baselines
  • Table3: Error rates in the validation set of ImageNet 2012 using a single view. The images are resized so min(w, h) = 256. The crop view is the central 224×224 of the image
  • Table4: Error rates in ImageNet 2012. All the results are based on a single network. The number of views in Overfeat depends on the scales and strides, for which there are several hundreds at the finest scale
  • Table5: The competition results of ILSVRC 2014 classification [<a class="ref-link" id="c26" href="#r26">26</a>]. The best entry of each team is listed
  • Table6: Classification mAP in Pascal VOC 2007. For SPP-net, the pool5/7 layer uses the 6×6 pyramid level
  • Table7: Classification accuracy in Caltech101. For SPP-net, the pool5/7 layer uses the 6×6 pyramid level
  • Table8: Classification results for Pascal VOC 2007 (mAP) and Caltech101 (accuracy). †numbers reported by [<a class="ref-link" id="c27" href="#r27">27</a>]. ‡our implementation as in Table 6 (a)
  • Table9: Detection results (mAP) on Pascal VOC 2007. “ft” and “bb” denote fine-tuning and bounding box regression
  • Table10: Detection results (mAP) on Pascal VOC 2007, using the same pre-trained model of SPP (ZF-5)
  • Table11: Comparisons of detection results on Pascal VOC 2007
  • Table12: Detection results on VOC 2007 using model combination. The results of both models use “ftfc7 bb”. SPP-net (2)) shows the results of this network. Its mAP is comparable with the first network (59.1% vs. 59.2%), and outperforms the first network in 11 categories
  • Table13: The competition results of ILSVRC 2014 detection (provided-data-only track) [<a class="ref-link" id="c26" href="#r26">26</a>]. The best entry of each team is listed
Download tables as Excel
Reference
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, 1989.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
    Google ScholarFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” arXiv:1311.2901, 2013.
    Findings
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv:1312.6229, 2013.
    Findings
  • A. V. K. Chatfield, K. Simonyan and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in ArXiv:1405.3531, 2014.
    Findings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
    Google ScholarFindings
  • W. Y. Zou, X. Wang, M. Sun, and Y. Lin, “Generic object detection with dense neural patterns and regionlets,” in ArXiv:1404.4316, 2014.
    Findings
  • A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: An astounding baseline for recogniton,” in CVPR 2014, DeepVision Workshop, 2014.
    Google ScholarLocate open access versionFindings
  • Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in CVPR, 2014.
    Google ScholarFindings
  • N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdevr, “Panda: Pose aligned networks for deep attribute modeling,” in CVPR, 2014.
    Google ScholarFindings
  • Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling of deep convolutional activation features,” in ArXiv:1403.1840, 2014.
    Findings
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” arXiv:1310.1531, 2013.
    Findings
  • K. Grauman and T. Darrell, “The pyramid match kernel: Discriminative classification with sets of image features,” in ICCV, 2005.
    Google ScholarFindings
  • S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR, 2006.
    Google ScholarFindings
  • J. Sivic and A. Zisserman, “Video google: a text retrieval approach to object matching in videos,” in ICCV, 2003.
    Google ScholarFindings
  • J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in CVPR, 2009.
    Google ScholarFindings
  • J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in CVPR, 2010.
    Google ScholarFindings
  • F. Perronnin, J. Sanchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in ECCV, 2010.
    Google ScholarFindings
  • K. E. van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders, “Segmentation as selective search for object recognition,” in ICCV, 2011.
    Google ScholarFindings
  • L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” CVIU, 2007.
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,” 2007.
    Google ScholarLocate open access versionFindings
  • P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained partbased models,” PAMI, 2010.
    Google ScholarLocate open access versionFindings
  • N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
    Google ScholarFindings
  • C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” arXiv:1409.0575, 2014.
    Findings
  • K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil is in the details: an evaluation of recent feature encoding methods,” in BMVC, 2011.
    Google ScholarFindings
  • A. Coates and A. Ng, “The importance of encoding versus training with sparse coding and vector quantization,” in ICML, 2011.
    Google ScholarFindings
  • D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004.
    Google ScholarLocate open access versionFindings
  • J. C. van Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W. Smeulders, “Kernel codebooks for scene categorization,” in ECCV, 2008.
    Google ScholarFindings
  • M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv:1312.4400, 2013.
    Findings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv:1409.4842, 2014.
    Findings
  • K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
    Findings
  • M. Oquab, L. Bottou, I. Laptev, J. Sivic et al., “Learning and transferring mid-level image representations using convolutional neural networks,” in CVPR, 2014.
    Google ScholarFindings
  • Y. Jia, “Caffe: An open source convolutional architecture for fast feature embedding,” http://caffe.berkeleyvision.org/, 2013.
    Findings
  • A. G. Howard, “Some improvements on deep convolutional neural network based image classification,” ArXiv:1312.5402, 2013.
    Findings
  • H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid, “Aggregating local image descriptors into compact codes,” TPAMI, vol. 34, no. 9, pp. 1704–1716, 2012.
    Google ScholarLocate open access versionFindings
  • C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), 2011.
    Google ScholarLocate open access versionFindings
  • X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic object detection,” in ICCV, 2013.
    Google ScholarFindings
  • C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in NIPS, 2013. arXiv v1. Initial technical report for ECCV 2014 paper.
    Google ScholarFindings
  • arXiv v2. Submitted version for TPAMI. Includes extra experiments of SPP on various architectures. Includes details for ILSVRC 2014.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments