Metaanchor: Learning to detect objects with customized anchors

Xiangyu Zhang
Xiangyu Zhang
Zeming Li
Zeming Li
Wenqiang Zhang
Wenqiang Zhang
Jian Sun
Jian Sun

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), pp. 320-330, 2018.

Cited by: 40|Bibtex|Views35|Links
Keywords:
predefined anchorsingle shotbox distributiontransfer taskdetection systemMore(15+)
Weibo:
We propose a novel and flexible anchor mechanism named MetaAnchor for object detection frameworks, in which anchor functions could be dynamically generated from the arbitrary customized prior boxes

Abstract:

We propose a novel and flexible anchor mechanism named MetaAnchor for object detection frameworks. Unlike many previous detectors model anchors via a predefined manner, in MetaAnchor anchor functions could be dynamically generated from the arbitrary customized prior boxes. Taking advantage of weight prediction, MetaAnchor is able to work ...More

Code:

Data:

0
Introduction
  • The last few years have seen the success of deep neural networks in object detection task [5, 39, 9, 12, 8, 32, 16, 2].
  • Object detection often requires to generate a set of bounding boxes along with their classification labels associated with each object in the given image.
  • It is nontrivial for convolutional neural networks (CNNs) to directly predict an orderless set of arbitrary cardinality1.
Highlights
  • The last few years have seen the success of deep neural networks in object detection task [5, 39, 9, 12, 8, 32, 16, 2]
  • We find results of all the baseline models suffer from significantly drops especially on AP50, which implies the degradation of anchor functions; increasing the number of anchors works little on the performance
  • We propose a novel and flexible anchor mechanism named MetaAnchor for object detection frameworks, in which anchor functions could be dynamically generated from the arbitrary customized prior boxes
  • Compared with the predefined anchor scheme, we empirically find that MetaAnchor is more robust to anchor settings and bounding box distributions; in addition, it shows the potential on transfer tasks
  • Our experiment on COCO detection task shows that MetaAnchor consistently outperforms the counterparts in various scenarios
Methods
  • Baseline MetaAnchor Search mAP @0.5(%) 82.5

    4.1.3 Cross evaluation between datasets of different distributions

    Though domain adaption or transfer learning [29] is out of the design purpose of MetaAnchor, recently the technique of weight prediction[10], which is employed in the paper, has been successfully applied in those tasks [15, 14].
  • 4.1.3 Cross evaluation between datasets of different distributions.
  • What about the performance if the detection model is trained with another dataset which has the same class labels but different distributions of object box sizes?.
  • After some ground truth boxes are erased, all the scores drop significantly; compared with the RetinaNet baseline, MetaAnchor suffers from smaller degradations and generates much better predictions, which shows the potential on the transfer tasks
Results
  • Results on COCO Object Detection

    the authors compare the fully-equipped MetaAnchor models with RetinaNet [23] baselines on COCOfull dataset.
  • Results on COCO Object Detection.
  • The authors further investigate a lot of anchor box configurations and retrain the baseline model, the best of which is named “RetinaNet∗” and marked with “search” in Table 8.
  • The authors' MetaAnchor model achieves 37.5% mmAP on COCO minival, which is 1.7% better than the original RetinaNet and 0.6% better than the best searched entry of RetinaNet. The authors' data-dependent variant (Equ. 6) further boosts the performance by 0.4%.
  • It is clear that the shapes of detected boxes vary according to the customized anchor box bi
Conclusion
  • The authors propose a novel and flexible anchor mechanism named MetaAnchor for object detection frameworks, in which anchor functions could be dynamically generated from the arbitrary customized prior boxes.
  • MetaAnchor is able to work with most of the anchor-based object detection systems such as RetinaNet. Compared with the predefined anchor scheme, the authors empirically find that MetaAnchor is more robust to anchor settings and bounding box distributions; in addition, it shows the potential on transfer tasks.
  • The authors' experiment on COCO detection task shows that MetaAnchor consistently outperforms the counterparts in various scenarios
Summary
  • Introduction:

    The last few years have seen the success of deep neural networks in object detection task [5, 39, 9, 12, 8, 32, 16, 2].
  • Object detection often requires to generate a set of bounding boxes along with their classification labels associated with each object in the given image.
  • It is nontrivial for convolutional neural networks (CNNs) to directly predict an orderless set of arbitrary cardinality1.
  • Methods:

    Baseline MetaAnchor Search mAP @0.5(%) 82.5

    4.1.3 Cross evaluation between datasets of different distributions

    Though domain adaption or transfer learning [29] is out of the design purpose of MetaAnchor, recently the technique of weight prediction[10], which is employed in the paper, has been successfully applied in those tasks [15, 14].
  • 4.1.3 Cross evaluation between datasets of different distributions.
  • What about the performance if the detection model is trained with another dataset which has the same class labels but different distributions of object box sizes?.
  • After some ground truth boxes are erased, all the scores drop significantly; compared with the RetinaNet baseline, MetaAnchor suffers from smaller degradations and generates much better predictions, which shows the potential on the transfer tasks
  • Results:

    Results on COCO Object Detection

    the authors compare the fully-equipped MetaAnchor models with RetinaNet [23] baselines on COCOfull dataset.
  • Results on COCO Object Detection.
  • The authors further investigate a lot of anchor box configurations and retrain the baseline model, the best of which is named “RetinaNet∗” and marked with “search” in Table 8.
  • The authors' MetaAnchor model achieves 37.5% mmAP on COCO minival, which is 1.7% better than the original RetinaNet and 0.6% better than the best searched entry of RetinaNet. The authors' data-dependent variant (Equ. 6) further boosts the performance by 0.4%.
  • It is clear that the shapes of detected boxes vary according to the customized anchor box bi
  • Conclusion:

    The authors propose a novel and flexible anchor mechanism named MetaAnchor for object detection frameworks, in which anchor functions could be dynamically generated from the arbitrary customized prior boxes.
  • MetaAnchor is able to work with most of the anchor-based object detection systems such as RetinaNet. Compared with the predefined anchor scheme, the authors empirically find that MetaAnchor is more robust to anchor settings and bounding box distributions; in addition, it shows the potential on transfer tasks.
  • The authors' experiment on COCO detection task shows that MetaAnchor consistently outperforms the counterparts in various scenarios
Tables
  • Table1: Anchor box configurations
  • Table2: Comparison of RetinaNets with/without MetaAnchor
  • Table3: Comparison of various anchors in inference (mmAP, %)
  • Table4: Comparison in the scenarios of different training/test distributions (mmAP, %) # of Anchors Baseline (all) MetaAnchor (all) Baseline (drop) MetaAnchor (drop)
  • Table5: Transfer evaluation on VOC 2007 test set from COCO-full dataset
  • Table6: Comparison of anchor function generators (mmAP, %)
  • Table7: Results of YOLOv2 on COCO minival (%) Method Baseline MetaAnchor Search
  • Table8: Results on COCO minival
Download tables as Excel
Related work
  • Anchor methodology in object detection. Anchors (maybe called with other names, e.g. “default boxes” in [25], “priors” in [39] or “grid cells” in [30]) are employed in most of the state-of-the-art detection systems [39, 32, 22, 23, 25, 7, 11, 2, 31, 21, 35, 15]. The essential of anchors includes position, size, class label or others. Currently most of the detectors model anchors via enumeration, i.e. predefining a number of anchor boxes with all kinds of positions, sizes and class labels, which leads to the following issues. First, anchor boxes need careful design, e.g. via clustering [31], which is especially critical on specific detection tasks such as anchor-based face [40, 45, 28, 36, 43] and pedestrian [41, 3, 44, 26] detections. Specially, some papers suggest multi-scale anchors [25, 22, 23] to handle different sizes of objects. Second, predefined anchor functions may cause too many parameters. A lot of work addresses the issue by weight sharing. For example, in contrast to earlier work like [5, 30], detectors like [32, 25, 31] and their follow-ups [7, 22, 2, 11, 23] employ translationinvariant anchors produced by fully-convolutional network, which could share parameters across different positions. Two-stage frameworks such as [32, 2] share weights across various classes. And [23] shares weights for multiple detection heads. In comparison, our approach is free of the issues, as anchor functions are customized and generated dynamically.
Funding
  • This work is supported by National Key R&D Program No 2017YFA0700800, China. 4https://github.com/pjreddie/darknet 8 (c) (d) (e) Model # of Anchors # of Anchors mmAP (%) RetinaNet [23] RetinaNet (our impl.) RetinaNet∗ (our impl.) 35.8 search search MetaAnchor (ours)
Reference
  • M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
    Google ScholarLocate open access versionFindings
  • J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
    Google ScholarLocate open access versionFindings
  • P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. IEEE transactions on pattern analysis and machine intelligence, 34(4):743–761, 2012.
    Google ScholarLocate open access versionFindings
  • M. Elhoseiny, B. Saleh, and A. Elgammal. Write a classifier: Zero-shot learning using purely textual descriptions. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2584–2591. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2147–2154, 2014.
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
    Google ScholarLocate open access versionFindings
  • C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
    Findings
  • R. Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.
    Findings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
    Google ScholarLocate open access versionFindings
  • D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
    Findings
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In european conference on computer vision, pages 346–361.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko. Lsda: Large scale detection through adaptation. In Advances in Neural Information Processing Systems, pages 3536–3544, 2014.
    Google ScholarLocate open access versionFindings
  • R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick. Learning to segment every thing. arXiv preprint arXiv:1711.10370, 2017.
    Findings
  • L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015.
    Findings
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
    Findings
  • M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
    Google ScholarLocate open access versionFindings
  • Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Light-head r-cnn: In defense of two-stage object detector. arXiv preprint arXiv:1711.07264, 2017.
    Findings
  • Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Detnet: A backbone network for object detection. arXiv preprint arXiv:1804.06215, 2018.
    Findings
  • Z. Li and F. Zhou. Fssd: Feature fusion single shot multibox detector. arXiv preprint arXiv:1712.00960, 2017.
    Findings
  • T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 4, 2017.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017.
    Findings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755.
    Google ScholarLocate open access versionFindings
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37.
    Google ScholarLocate open access versionFindings
  • J. Mao, T. Xiao, Y. Jiang, and Z. Cao. What can help pedestrian detection? In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017.
    Google ScholarLocate open access versionFindings
  • I. Misra, A. Gupta, and M. Hebert. From red wine to red tomato: Composition with context. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, 2017.
    Google ScholarLocate open access versionFindings
  • M. Najibi, P. Samangouei, R. Chellappa, and L. Davis. Ssh: Single stage headless face detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4875–4884, 2017.
    Google ScholarLocate open access versionFindings
  • S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
    Google ScholarLocate open access versionFindings
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
    Google ScholarLocate open access versionFindings
  • J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. arXiv preprint, 2017.
    Google ScholarFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • S. H. Rezatofighi, R. Kaskman, F. T. Motlagh, Q. Shi, D. Cremers, L. Leal-Taixé, and I. Reid. Deep perm-set net: Learn to predict sets with unknown permutation and cardinality using deep neural networks. arXiv preprint arXiv:1805.00613, 2018.
    Findings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue. Dsod: Learning deeply supervised object detectors from scratch. In The IEEE International Conference on Computer Vision (ICCV), volume 3, page 7, 2017.
    Google ScholarLocate open access versionFindings
  • G. Song, Y. Liu, M. Jiang, Y. Wang, J. Yan, and B. Leng. Beyond trade-off: Accelerate fcn-based face detector with higher accuracy. arXiv preprint arXiv:1804.05197, 2018.
    Findings
  • R. Stewart, M. Andriluka, and A. Y. Ng. End-to-end people detection in crowded scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2325–2333, 2016.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. Cvpr, 2015.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, S. Reed, D. Erhan, D. Anguelov, and S. Ioffe. Scalable, high-quality object detection. arXiv preprint arXiv:1412.1441, 2014.
    Findings
  • J. Wang, Y. Yuan, G. Yu, and S. Jian. Sface: An efficient network for face detection in large scale variations. arXiv preprint arXiv:1804.06559, 2018.
    Findings
  • X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen. Repulsion loss: Detecting pedestrians in a crowd. arXiv preprint arXiv:1711.07752, 2017.
    Findings
  • Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning. In European Conference on Computer Vision, pages 616–634.
    Google ScholarLocate open access versionFindings
  • K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
    Google ScholarLocate open access versionFindings
  • L. Zhang, L. Lin, X. Liang, and K. He. Is faster r-cnn doing well for pedestrian detection? In European Conference on Computer Vision, pages 443–457.
    Google ScholarLocate open access versionFindings
  • S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. S3fd: Single shot scale-invariant face detector. arXiv preprint arXiv:1708.05237, 2017.
    Findings
Your rating :
0

 

Tags
Comments