PolarMask: Single Shot Instance Segmentation with Polar Representation

    CVPR 2020, 2019.

    Cited by: 15|Bibtex|Views36|Links
    Keywords:
    time instance segmentationstochastic gradient descentbinary cross entropydense distance regressionobject detectionMore(13+)
    Wei bo:
    In order to maximize the advantages of Polar Representation, we propose Polar Centerness and Polar IoU Loss to deal with sampling high-quality center examples and optimization for dense distance regression, respectively

    Abstract:

    In this paper, we introduce an anchor-box free and single shot instance segmentation method, which is conceptually simple, fully convolutional and can be used as a mask prediction module for instance segmentation, by easily embedding it into most off-the-shelf detection methods. Our method, termed PolarMask, formulates the instance segm...More
    0
    Introduction
    • Instance segmentation is one of the fundamental tasks in computer vision, which enables numerous downstream vision applications.
    • It is challenging as it requires to predict both the location and the semantic mask of each instance in an image.
    • The authors' aim is to design a conceptually simple mask prediction module that can be plugged into many off-the-shelf detectors, enabling instance segmentation.
    • Instance segmentation is usually solved by binary classification in a spatial layout surrounded by bounding boxes, shown in Figure 1(b).
    • Such pixel-to-pixel correspondence
    Highlights
    • Instance segmentation is one of the fundamental tasks in computer vision, which enables numerous downstream vision applications
    • In order to maximize the advantages of Polar Representation, we propose Polar Centerness and Polar IoU Loss to deal with sampling high-quality center examples and optimization for dense distance regression, respectively
    • We introduce a brand new framework for instance segmentation, termed PolarMask, to model instance masks in the polar coordinate, which converts instance segmentation to two parallel tasks: instance center classification and dense distance regression
    • For the first time, we show that the complexity of instance segmentation, in terms of both design and computation complexity, can be the same as bounding box object detection
    • If one ray, which starts from the center outside of the mask, does not have intersection points with the contour of an instance at some certain angles, we set its regression target as the minimum value ǫ (e.g., ǫ = 10−6). We argue that these corner cases are the main obstacles of restricting the upper bound of Polar Representation from reaching 100% AP
    • Different from previous works that typically solve mask prediction as binary classification in a spatial layout, PolarMask puts forward to represent a mask by its contour and model the contour by one center and rays emitted from the center to the contour in polar coordinate
    Methods
    • The authors first briefly introduce the overall architecture of the proposed PolarMask.
    • The authors reformulate instance segmentation with the proposed Polar Representation.
    • PolarMask is a simple, unified network composed of a backbone network [16], a feature pyramid network [22], and two or three task-specific heads, depending on whether predicting bounding boxes.1.
    • The settings of the backbone and feature pyramid network are the same as FCOS [29].
    • While there exist many stronger candidates for those components, the authors align these settings with FCOS to show the simplicity and effectiveness of the instance modeling method
    Results
    • The authors' Polar IoU Loss achieves 27.7% AP without balancing regression loss and classification loss.
    • DCN can boost up to at least 2.3% at the different backbones
    Conclusion
    • PolarMask is a single shot anchor-box free instance segmentation method.
    • Different from previous works that typically solve mask prediction as binary classification in a spatial layout, PolarMask puts forward to represent a mask by its contour and model the contour by one center and rays emitted from the center to the contour in polar coordinate.
    • PolarMask is designed almost as simple and clean as single-shot object detectors, introducing negligible computing overhead.
    • The authors hope that the proposed PolarMask framework can serve as a fundamental and strong baseline for single-shot instance segmentation tasks.
    Summary
    • Introduction:

      Instance segmentation is one of the fundamental tasks in computer vision, which enables numerous downstream vision applications.
    • It is challenging as it requires to predict both the location and the semantic mask of each instance in an image.
    • The authors' aim is to design a conceptually simple mask prediction module that can be plugged into many off-the-shelf detectors, enabling instance segmentation.
    • Instance segmentation is usually solved by binary classification in a spatial layout surrounded by bounding boxes, shown in Figure 1(b).
    • Such pixel-to-pixel correspondence
    • Methods:

      The authors first briefly introduce the overall architecture of the proposed PolarMask.
    • The authors reformulate instance segmentation with the proposed Polar Representation.
    • PolarMask is a simple, unified network composed of a backbone network [16], a feature pyramid network [22], and two or three task-specific heads, depending on whether predicting bounding boxes.1.
    • The settings of the backbone and feature pyramid network are the same as FCOS [29].
    • While there exist many stronger candidates for those components, the authors align these settings with FCOS to show the simplicity and effectiveness of the instance modeling method
    • Results:

      The authors' Polar IoU Loss achieves 27.7% AP without balancing regression loss and classification loss.
    • DCN can boost up to at least 2.3% at the different backbones
    • Conclusion:

      PolarMask is a single shot anchor-box free instance segmentation method.
    • Different from previous works that typically solve mask prediction as binary classification in a spatial layout, PolarMask puts forward to represent a mask by its contour and model the contour by one center and rays emitted from the center to the contour in polar coordinate.
    • PolarMask is designed almost as simple and clean as single-shot object detectors, introducing negligible computing overhead.
    • The authors hope that the proposed PolarMask framework can serve as a fundamental and strong baseline for single-shot instance segmentation tasks.
    Tables
    • Table1: Ablation experiments for PolarMask. All models are trained on trainval35k and tested on minival, using ResNet50-FPN backbone unless otherwise noted
    • Table2: Instance segmentation mask AP on the COCO test-dev. The standard training strategy [<a class="ref-link" id="c14" href="#r14">14</a>] is training by 12 epochs; and ‘aug’ means data augmentation, including multi-scale and random crop. is training with ‘aug’, ◦ is without ‘aug’
    • Table3: Computation complexity and parameters comparison with other methods. Note that “PolarMask w/o box” only introduces marginal computation when compared with FCOS
    • Table4: Speed analysis of different methods. All post-processing are included. The input images are resized to have their shorter side being
    • Table5: Benchmark results of PolarMask on MS-COCO [<a class="ref-link" id="c24" href="#r24">24</a>] validation set (minival). All the models here are trained with FPN [<a class="ref-link" id="c22" href="#r22">22</a>]
    • Table6: Comparasion of PolarMask and ESE-Seg on COCO 2017 val. is equipped with ‘Polar IoU Loss’ and ‘Polar Centerness’, ◦ is not
    Download tables as Excel
    Related work
    • Two-Stage Instance Segmentation. Two-stage instance segmentation often formulates this task as the paradigm of “Detect then Segment” [21, 15, 25, 18]. They often detect bounding boxes then perform segmentation in the area of each bounding box. The main idea of FCIS [21] is to predict a set of position-sensitive output channels fully convolutionally. These channels simultaneously address object classes, boxes, and masks, making the system fast. Mask R-CNN [15], built upon Faster R-CNN, simply adds an additional mask branch and use RoI-Align to replace RoI-Pooling [12] for improved accuracy. Following Mask R-CNN, PANet [25] introduces bottom-up path augmentation, adaptive feature pooling, and fully-connected fusion to boost up the performance of instance segmentation. Mask Scoring R-CNN [18] re-scores the confidence of mask from classification score by adding a mask-IoU branch, which makes the network to predict the IoU of mask and groundtruth.
    Funding
    • Introduces an anchor-box free and single shot instance segmentation method, which is conceptually simple, fully convolutional and can be used by embedding it into most off-the-shelf detection methods
    • Shows that the complexity of instance segmentation, in terms of both design and computation complexity, can be the same as bounding box object detection and this much simpler and flexible instance segmentation framework can achieve competitive accuracy
    • Proposes Polar Centerness and Polar IoU Loss to deal with sampling high-quality center examples and optimization for dense distance regression, respectively
    • Introduces a brand new framework for instance segmentation, termed PolarMask, to model instance masks in the polar coordinate, which converts instance segmentation to two parallel tasks: instance center classification and dense distance regression
    • Proposes the Polar IoU Loss and Polar Centerness, tailored for our framework
    Reference
    • Min Bai and Raquel Urtasun. Deep watershed transform for instance segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 5221–5229, 2017.
      Google ScholarLocate open access versionFindings
    • Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. Proc. IEEE Int. Conf. Comp. Vis., 2019.
      Google ScholarLocate open access versionFindings
    • Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Mmdetection: Open mmlab detection toolbox and benchmark, 2019.
      Google ScholarFindings
    • Xinlei Chen, Ross Girshick, Kaiming He, and Piotr Dollar. Tensormask: A foundation for dense object segmentation. Proc. IEEE Int. Conf. Comp. Vis., 2019.
      Google ScholarLocate open access versionFindings
    • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3213–3223, 2016.
      Google ScholarLocate open access versionFindings
    • Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, and Jian Sun. Instance-sensitive fully convolutional networks. In Proc. Eur. Conf. Comp. Vis., pages 534–549.
      Google ScholarLocate open access versionFindings
    • Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3150– 3158, 2016.
      Google ScholarLocate open access versionFindings
    • Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proc. IEEE Int. Conf. Comp. Vis., pages 764– 773, 2017.
      Google ScholarLocate open access versionFindings
    • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 248–255.
      Google ScholarLocate open access versionFindings
    • Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In Proc. IEEE Int. Conf. Comp. Vis., pages 6569–6578, 2019.
      Google ScholarLocate open access versionFindings
    • Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 2010.
      Google ScholarLocate open access versionFindings
    • Ross Girshick. Fast R-CNN. In Proc. IEEE Int. Conf. Comp. Vis., pages 1440–1448, 2015.
      Google ScholarLocate open access versionFindings
    • Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 580–587, 2014.
      Google ScholarLocate open access versionFindings
    • Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollar, and Kaiming He. Detectron. https://github.com/facebookresearch/detectron, 2018.
      Findings
    • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. In Proc. IEEE Int. Conf. Comp. Vis., pages 2961–2969, 2017.
      Google ScholarLocate open access versionFindings
    • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2016.
      Google ScholarLocate open access versionFindings
    • Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015.
      Findings
    • Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Huang, and Xinggang Wang. Mask scoring R-CNN. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 6409– 6418, 2019.
      Google ScholarLocate open access versionFindings
    • Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, and Jianbo Shi. Foveabox: Beyond anchor-based object detector. arXiv preprint arXiv:1904.03797, 2019.
      Findings
    • Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982, 2018.
      Findings
    • Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. Fully convolutional instance-aware semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2359– 2367, 2017.
      Google ScholarLocate open access versionFindings
    • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., July 2017.
      Google ScholarLocate open access versionFindings
    • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In Proc. IEEE Int. Conf. Comp. Vis., Oct 2017.
      Google ScholarLocate open access versionFindings
    • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proc. Eur. Conf. Comp. Vis., pages 740–755.
      Google ScholarLocate open access versionFindings
    • Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 8759–8768, 2018.
      Google ScholarLocate open access versionFindings
    • Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Proc. Eur. Conf. Comp. Vis., pages 483–499.
      Google ScholarLocate open access versionFindings
    • Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 779–788, 2016.
      Google ScholarLocate open access versionFindings
    • Uwe Schmidt, Martin Weigert, Coleman Broaddus, and Gene Myers. Cell detection with star-convex polygons. In Proc. Int. Medical Image Computing and Computer-Assisted Intervention, pages 265–273.
      Google ScholarLocate open access versionFindings
    • Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In Proc. IEEE Int. Conf. Comp. Vis., 2019.
      Google ScholarLocate open access versionFindings
    • Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1492–1500, 2017.
      Google ScholarLocate open access versionFindings
    • Wenqiang Xu, Haiyang Wang, Fubo Qi, and Cewu Lu. Explicit shape encoding for real-time instance segmentation. In Proc. IEEE Int. Conf. Comp. Vis., pages 5168–5177, 2019.
      Google ScholarLocate open access versionFindings
    • Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen Lin. Reppoints: Point set representation for object detection. arXiv: Comp. Res. Repository, 2019.
      Google ScholarLocate open access versionFindings
    • Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang. Unitbox: An advanced object detection network. In Proc. ACM Int. Conf. Multimedia, pages 516–520. ACM, 2016.
      Google ScholarLocate open access versionFindings
    • Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Objects as points. arXiv: Comp. Res. Repository, 2019.
      Google ScholarLocate open access versionFindings
    • Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. Bottom-up object detection by grouping extreme and center points. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 850–859, 2019.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments