Focal Loss for Dense Object Detection

ICCV, pp. 318-327, 2020.

Cited by: 6619|Views4350
EI WOS
Weibo:
We propose the focal loss which applies a modulating term to the cross entropy loss in order to focus learning on hard examples and down-weight the numerous easy negatives

Abstract:

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">sparse</italic> set of candidate object locations. In contrast, one-stage detectors that are applied ov...More

Code:

Data:

0
Introduction
  • Current state-of-the-art object detectors are based on a two-stage, proposal-driven mechanism.
  • This paper pushes the envelop further: the authors present a onestage object detector that, for the first time, matches the state-of-the-art COCO AP of more complex two-stage detectors, such as the Feature Pyramid Network (FPN) [19] or Mask R-CNN [13] variants of Faster R-CNN [27].
  • The authors identify class imbalance during training as the main obstacle impeding one-stage detector from achieving state-of-the-art accuracy and propose a new loss function that eliminates this barrier
Highlights
  • Current state-of-the-art object detectors are based on a two-stage, proposal-driven mechanism
  • As popularized in the R-CNN framework [11], the first stage generates a sparse set of candidate object locations and the second stage classifies each candidate location as one of the foreground classes or as background using a convolutional neural network
  • We show that our proposed focal loss naturally handles the class imbalance faced by a one-stage detector and allows us to efficiently train on all examples without sampling and without easy negatives overwhelming the loss and computed gradients
  • We identify class imbalance as the primary obstacle preventing one-stage object detectors from surpassing top-performing, two-stage methods, such as Faster R-CNN variants
  • We propose the focal loss which applies a modulating term to the cross entropy loss in order to focus learning on hard examples and down-weight the numerous easy negatives
  • We demonstrate its efficacy by designing a fully convolutional one-stage detector and report extensive experimental analysis showing that it achieves state-of-theart accuracy and run time on the challenging COCO dataset
Methods
  • The authors present experimental results on the bounding box detection track of the challenging COCO benchmark [20].
  • The authors follow common practice [1, 19] and use the COCO trainval35k split.
  • The authors report lesion and sensitivity studies by evaluating on the minival split.
  • The authors report COCO AP on the test-dev split, which has no public labels and requires use of the evaluation server.
  • For all ablation studies the authors use an image scale of 600 pixels for training and testing
Results
  • SSD has a 10-20% lower AP, while YOLO focuses on an even more extreme speed/accuracy trade-off.
  • In practice the authors use an α-balanced variant of the focal loss: FL = −αt(1 − pt)γ log.
  • The authors adopt this form in the experiments as it yields slightly improved accuracy over the non-α-balanced form
Conclusion
  • The authors identify class imbalance as the primary obstacle preventing one-stage object detectors from surpassing top-performing, two-stage methods, such as Faster R-CNN variants.
  • The authors propose the focal loss which applies a modulating term to the cross entropy loss in order to focus learning on hard examples and down-weight the numerous easy negatives.
  • The authors' approach is simple and highly effective.
  • The authors demonstrate its efficacy by designing a fully convolutional one-stage detector and report extensive experimental analysis showing that it achieves state-of-theart accuracy and run time on the challenging COCO dataset
Summary
  • Introduction:

    Current state-of-the-art object detectors are based on a two-stage, proposal-driven mechanism.
  • This paper pushes the envelop further: the authors present a onestage object detector that, for the first time, matches the state-of-the-art COCO AP of more complex two-stage detectors, such as the Feature Pyramid Network (FPN) [19] or Mask R-CNN [13] variants of Faster R-CNN [27].
  • The authors identify class imbalance during training as the main obstacle impeding one-stage detector from achieving state-of-the-art accuracy and propose a new loss function that eliminates this barrier
  • Methods:

    The authors present experimental results on the bounding box detection track of the challenging COCO benchmark [20].
  • The authors follow common practice [1, 19] and use the COCO trainval35k split.
  • The authors report lesion and sensitivity studies by evaluating on the minival split.
  • The authors report COCO AP on the test-dev split, which has no public labels and requires use of the evaluation server.
  • For all ablation studies the authors use an image scale of 600 pixels for training and testing
  • Results:

    SSD has a 10-20% lower AP, while YOLO focuses on an even more extreme speed/accuracy trade-off.
  • In practice the authors use an α-balanced variant of the focal loss: FL = −αt(1 − pt)γ log.
  • The authors adopt this form in the experiments as it yields slightly improved accuracy over the non-α-balanced form
  • Conclusion:

    The authors identify class imbalance as the primary obstacle preventing one-stage object detectors from surpassing top-performing, two-stage methods, such as Faster R-CNN variants.
  • The authors propose the focal loss which applies a modulating term to the cross entropy loss in order to focus learning on hard examples and down-weight the numerous easy negatives.
  • The authors' approach is simple and highly effective.
  • The authors demonstrate its efficacy by designing a fully convolutional one-stage detector and report extensive experimental analysis showing that it achieves state-of-theart accuracy and run time on the challenging COCO dataset
Tables
  • Table1: Ablation experiments for RetinaNet and Focal Loss (FL). All models are trained on trainval35k and tested on minival unless noted. If not specified, default values are: γ = 2; anchors for 3 scales and 3 aspect ratios; ResNet-50-FPN backbone; and a 600 pixel train and test image scale. (a) RetinaNet with α-balanced CE achieves at most 31.1 AP. (b) In contrast, using FL with the same exact network gives a 2.9 AP gain and is fairly robust to exact γ/α settings. (c) Using 2-3 scale and 3 aspect ratio anchors yields good results after which point performance saturates. (d) FL outperforms the best variants of online hard example mining (OHEM) [<a class="ref-link" id="c30" href="#r30">30</a>, <a class="ref-link" id="c21" href="#r21">21</a>] by over 3 points AP. (e) Accuracy/Speed trade-off of RetinaNet on test-dev for various network depths and image scales (see also Figure 2)
  • Table2: Object detection single-model results (bounding box AP), vs. state-of-the-art on COCO test-dev. We show results for our RetinaNet-101-800 model, trained with scale jitter and for 1.5× longer than the same model from Table 1e. Our model achieves top results, outperforming both one-stage and two-stage models. For a detailed breakdown of speed versus accuracy see Table 1e and Figure 2
Download tables as Excel
Related work
  • Classic Object Detectors: The sliding-window paradigm, in which a classifier is applied on a dense image grid, has a long and rich history. One of the earliest successes is the classic work of LeCun et al who applied convolutional neural networks to handwritten digit recognition [18, 35]. Viola and Jones [36] used boosted object detectors for face detection, leading to widespread adoption of such models. The introduction of HOG [4] and integral channel features [5] gave rise to effective methods for pedestrian detection. DPMs [8] helped extend dense detectors to more general object categories and had top results on PASCAL [7] for many years. While the sliding-window approach was the leading detection paradigm in classic computer vision, with the resurgence of deep learning [17], two-stage detectors, described next, quickly came to dominate object detection.
Funding
  • SSD has a 10-20% lower AP, while YOLO focuses on an even more extreme speed/accuracy trade-off
  • In practice we use an α-balanced variant of the focal loss: FL(pt) = −αt(1 − pt)γ log(pt). We adopt this form in our experiments as it yields slightly improved accuracy over the non-α-balanced form
Reference
  • S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016. 6
    Google ScholarLocate open access versionFindings
  • S. R. Bulo, G. Neuhold, and P. Kontschieder. Loss maxpooling for semantic image segmentation. In CVPR, 2017. 3
    Google ScholarLocate open access versionFindings
  • J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object detection via region-based fully convolutional networks. In NIPS, 2016. 1
    Google ScholarLocate open access versionFindings
  • N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 2
    Google ScholarLocate open access versionFindings
  • P. Dollar, Z. Tu, P. Perona, and S. Belongie. Integral channel features. 2009. 2, 3
    Google ScholarFindings
  • D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014. 2
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010. 2
    Google ScholarLocate open access versionFindings
  • P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Cascade object detection with deformable part models. In CVPR, 2010. 2, 3
    Google ScholarLocate open access versionFindings
  • C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. DSSD: Deconvolutional single shot detector. arXiv:1701.06659, 2016. 1, 2, 8
    Findings
  • R. Girshick. Fast R-CNN. In ICCV, 2015. 1, 2, 4, 6, 8
    Google ScholarFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer series in statistics Springer, Berlin, 2008. 3, 7
    Google ScholarFindings
  • K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask RCNN. In ICCV, 2017. 1, 2, 4
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV. 202
    Google ScholarFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 2, 4, 5, 6, 8
    Google ScholarLocate open access versionFindings
  • J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. 2017. 2, 8
    Google ScholarFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 2
    Google ScholarLocate open access versionFindings
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989. 2
    Google ScholarFindings
  • T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 1, 2, 4, 5, 6, 8
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 1, 6
    Google ScholarLocate open access versionFindings
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. SSD: Single shot multibox detector. In ECCV, 2016. 1, 2, 3, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 4
    Google ScholarLocate open access versionFindings
  • P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015. 2, 4
    Google ScholarLocate open access versionFindings
  • P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar. Learning to refine object segments. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • J. Redmon and A. Farhadi. YOLO9000: Better, faster, stronger. In CVPR, 2017. 1, 2, 8
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 2, 4, 5, 8
    Google ScholarLocate open access versionFindings
  • H. Rowley, S. Baluja, and T. Kanade. Human face detection in visual scenes. Technical Report CMU-CS-95-158R, Carnegie Mellon University, 1995. 2
    Google ScholarFindings
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014. 2
    Google ScholarLocate open access versionFindings
  • A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In CVPR, 2016. 2, 3, 6, 7
    Google ScholarLocate open access versionFindings
  • A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Top-down modulation for object detection. arXiv:1612.06851, 2016. 2, 8
    Findings
  • K.-K. Sung and T. Poggio. Learning and Example Selection for Object and Pattern Detection. In MIT A.I. Memo No. 1521, 1994. 2, 3
    Google ScholarLocate open access versionFindings
  • C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv:1602.07261, 2016. 8
    Findings
  • J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013. 2, 4
    Google ScholarLocate open access versionFindings
  • R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE Proc. on Vision, Image, and Signal Processing, 1994. 2
    Google ScholarLocate open access versionFindings
  • P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001. 2, 3
    Google ScholarLocate open access versionFindings
  • C. L. Zitnick and P. Dollar. Edge boxes: Locating object proposals from edges. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments