An End-to-End Network for Panoptic Segmentation

    CVPR, pp. 6172-6181, 2019.

    Cited by: 36|Bibtex|Views30|Links
    EI
    Keywords:
    object instancesemantic image segmentationinstance segmentationdeep convolutional networksemantic segmentationMore(16+)
    Wei bo:
    It is easy to find that Segmentation Quality is the common mean IOU metric normalized for matching instances, and Detection Quality could be regarded as a form of detection accuracy

    Abstract:

    Panoptic segmentation, which needs to assign a category label to each pixel and segment each object instance simultaneously, is a challenging topic. Traditionally, the existing approaches utilize two independent models without sharing features, which makes the pipeline inefficient to implement. In addition, a heuristic method is usually e...More

    Code:

    Data:

    0
    Introduction
    • The goal is to assign each pixel with a category label and segment each object instance in the image.
    • In this task, the stuff segmentation is employed to predict the amorphous regions while the instance segmentation [14] solves the countable objects.
    • The instance and stuff segmentation blocks are independent
    Highlights
    • Panoptic segmentation [18] is a new challenging topic for scene understanding
    • We introduce a novel spatial ranking module to address the ambiguities of the overlapping relationship, which commonly exists in the problem of panoptic segmentation
    • The mathematical formations of PQ, Segmentation Quality and Detection Quality are presented in Equation 5, where p and g are predictions and ground truth, and TP, FP, FN represent true positives, false positives and false negatives
    • It is easy to find that Segmentation Quality is the common mean IOU metric normalized for matching instances, and Detection Quality could be regarded as a form of detection accuracy
    • We propose a novel end-to-end occlusion aware algorithm, which incorporates the common semantic segmentation and instance segmentation into a single model
    • We have observed the particular ranking problem raised in the panoptic segmentation, and design the simple but effective spatial ranking module to deal with this issue
    Methods
    • ResNet-50 37.2 45.4 24.9 77.1 81.5 70.6 45.7 54.4 32.5 w/pano-instance GT ResNet-50 36.1 43.5 24.9 76.1 80.0 70.3 44.5 52.4 32.7 w/spatial ranking module ResNet-50 39.0 48.3 24.9 77.1 81.4 70.6 47.8 58.0 32.5 baseline.
    • PQ PQTh PQSt the accuracy and recall for objects
    • This phenomenon may come from the fact that most of the objects in COCO do not meet the overlapping issue, and forcing the network to learn non-overlapping hurts the overall performance.
    • Merely replacing the instance ground truth do not help improve the performance, and may reversely reduce
    Results
    • The authors use the standard evaluation metric defined in [18], called Panoptic Quality (PQ).
    • It contains two factors: 1) the Segmentation Quality (SQ) measures the quality of all categories and 2) the Detection Quality (DQ) measures only the instance classes.
    • The matching threshold is set to 0.5, that is if the pixel IOU of prediction and ground truth is larger than 0.5, the prediction is regarded matched, otherwise unmatched.
    • Each stuff class in an image is regarded as one instance, no matter the shape of it
    Conclusion
    • The authors propose a novel end-to-end occlusion aware algorithm, which incorporates the common semantic segmentation and instance segmentation into a single model.
    • In order to better employ the different supervisions and reduce the consumption of computation resources, the authors investigate the feature sharing between different branches and find that the authors should share as many features as possible.
    • The authors have observed the particular ranking problem raised in the panoptic segmentation, and design the simple but effective spatial ranking module to deal with this issue.
    • The experiment results show that the approach outperforms the previous state-of-the-art models
    Summary
    • Introduction:

      The goal is to assign each pixel with a category label and segment each object instance in the image.
    • In this task, the stuff segmentation is employed to predict the amorphous regions while the instance segmentation [14] solves the countable objects.
    • The instance and stuff segmentation blocks are independent
    • Methods:

      ResNet-50 37.2 45.4 24.9 77.1 81.5 70.6 45.7 54.4 32.5 w/pano-instance GT ResNet-50 36.1 43.5 24.9 76.1 80.0 70.3 44.5 52.4 32.7 w/spatial ranking module ResNet-50 39.0 48.3 24.9 77.1 81.4 70.6 47.8 58.0 32.5 baseline.
    • PQ PQTh PQSt the accuracy and recall for objects
    • This phenomenon may come from the fact that most of the objects in COCO do not meet the overlapping issue, and forcing the network to learn non-overlapping hurts the overall performance.
    • Merely replacing the instance ground truth do not help improve the performance, and may reversely reduce
    • Results:

      The authors use the standard evaluation metric defined in [18], called Panoptic Quality (PQ).
    • It contains two factors: 1) the Segmentation Quality (SQ) measures the quality of all categories and 2) the Detection Quality (DQ) measures only the instance classes.
    • The matching threshold is set to 0.5, that is if the pixel IOU of prediction and ground truth is larger than 0.5, the prediction is regarded matched, otherwise unmatched.
    • Each stuff class in an image is regarded as one instance, no matter the shape of it
    • Conclusion:

      The authors propose a novel end-to-end occlusion aware algorithm, which incorporates the common semantic segmentation and instance segmentation into a single model.
    • In order to better employ the different supervisions and reduce the consumption of computation resources, the authors investigate the feature sharing between different branches and find that the authors should share as many features as possible.
    • The authors have observed the particular ranking problem raised in the panoptic segmentation, and design the simple but effective spatial ranking module to deal with this issue.
    • The experiment results show that the approach outperforms the previous state-of-the-art models
    Tables
    • Table1: Loss balance between instance segmentation and stuff segmentation
    • Table2: Ablation study results on stuff segmentation network design. Stuff-SC represents stuff supervision classes. It refers to predict stuff classes. While both Stuff-SC and Object-SC mean predicting all classes
    • Table3: Results on MS-COCO panoptic segmentation validation dataset which use our spatial ranking module method. W/pano-instance GT represents using panoptic segmentation ground truth to generate instance segmentation ground truth. It is trained in two separate networks. All results in this table are based on backbone ResNet-50
    • Table4: Results on whether share stuff segmentation and instance segmentation features. On ResNet-50 backbone, sharing features method gets a gain of 0.7 in PQ, and ResNet-101 gets a gain of 0.7. Ablation study results on different sharing feature way. The res1res5 means just share the backbone ResNet features. The +skipconnection means share both the backbone features and FPN skipconnection branch
    • Table5: Results on the convolution settings of spatial ranking module. 1 × 1 represents the convolution kernel size is 1. Results shows that the large receptive field can help the spatial ranking module get more context features and better results
    • Table6: Results on the COCO 2018 panoptic segmentation challenge test-dev. Results verifies the effectiveness of our feature sharing mode and the spatial ranking module. We use the ResNet101 as our basemodel
    Download tables as Excel
    Related work
    • 2.1. Instance Segmentation

      There are currently two main frameworks for instance segmentation, including the proposal-based methods and segmentation-based methods. The proposal-based approaches [8, 14, 24, 25, 28, 29, 33] first generate the object detection bounding boxes and then perform mask prediction on each box for instance segmentation. These methods are closely related to object detection algorithms such as Fast/Faster R-CNN and SPPNet [12, 15, 36]. Under this framework, the overlapping problem raises due to the independence prediction of distinct instances. That is, pixels may be allocated to wrong categories when covered by multiple masks. The segmentation-based methods use the semantic segmentation network to predict the pixel class, and obtain each instance mask by decoding the object boundary [19] or the custom field [2, 9, 27]. Finally, they use the bottom-up grouping mechanism to generate the object instances. RNN method was leveraged to predict a mask for each instance at a time in [35, 37, 46] .
    Funding
    • This research was supported by National Key R&D Program of China (No 2017YFA0700800)
    Reference
    • V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE TPAMI, 2017.
      Google ScholarLocate open access versionFindings
    • M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
      Google ScholarLocate open access versionFindings
    • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, 2018.
      Google ScholarFindings
    • L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016.
      Google ScholarLocate open access versionFindings
    • L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
      Google ScholarLocate open access versionFindings
    • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
      Google ScholarLocate open access versionFindings
    • J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016.
      Google ScholarLocate open access versionFindings
    • B. De Brabandere, D. Neven, and L. Van Gool. Semantic instance segmentation with a discriminative loss function. arXiv:1708.02551, 2017.
      Findings
    • D. de Geus, P. Meletis, and G. Dubbelman. Panoptic segmentation with a joint semantic and instance segmentation network. arXiv:1809.02110, 2018.
      Findings
    • M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. In IJCV, 2015.
      Google ScholarLocate open access versionFindings
    • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
      Google ScholarLocate open access versionFindings
    • B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015.
      Google ScholarLocate open access versionFindings
    • K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In ICCV, 2017.
      Google ScholarLocate open access versionFindings
    • K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.
      Google ScholarLocate open access versionFindings
    • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
      Google ScholarLocate open access versionFindings
    • A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, 2018.
      Google ScholarLocate open access versionFindings
    • A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar. Panoptic segmentation. arXiv:1801.00868, 2018.
      Findings
    • A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother. Instancecut: from edges to instances with multicut. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • I. Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
      Google ScholarLocate open access versionFindings
    • Q. Li, A. Arnab, and P. H. Torr. Weakly-and semi-supervised panoptic segmentation. In ECCV, 2018.
      Google ScholarLocate open access versionFindings
    • Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, and X. Wang. Attention-guided unified network for panoptic segmentation. arXiv:1812.03904, 2018.
      Findings
    • Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Detnet: Design backbone for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 334–350, 2018.
      Google ScholarLocate open access versionFindings
    • T.-Y. Lin, P. Dollar, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • S. Liu, J. Jia, S. Fidler, and R. Urtasun. Sgn: Sequential grouping networks for instance segmentation. In ICCV, 2017.
      Google ScholarLocate open access versionFindings
    • S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation. In CVPR, 2018.
      Google ScholarLocate open access versionFindings
    • S. Liu, X. Qi, J. Shi, H. Zhang, and J. Jia. Multi-scale patch aggregation (mpa) for simultaneous detection and segmentation. In CVPR, 2016.
      Google ScholarLocate open access versionFindings
    • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
      Google ScholarLocate open access versionFindings
    • I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Crossstitch networks for multi-task learning. In CVPR, 2016.
      Google ScholarLocate open access versionFindings
    • G. Neuhold, T. Ollmann, S. R. Bulo, and P. Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, 2017.
      Google ScholarLocate open access versionFindings
    • C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun. Megdet: A large mini-batch object detector. In CVPR, 2018.
      Google ScholarLocate open access versionFindings
    • C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters—improve semantic segmentation by global convolutional network. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • M. Ren and R. S. Zemel. End-to-end instance segmentation with recurrent attention. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
      Google ScholarLocate open access versionFindings
    • B. Romera-Paredes and P. H. S. Torr. Recurrent instance segmentation. In ECCV, 2016.
      Google ScholarLocate open access versionFindings
    • O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
      Google ScholarLocate open access versionFindings
    • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
      Findings
    • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
      Google ScholarLocate open access versionFindings
    • F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In ECCV, 2016.
      Google ScholarLocate open access versionFindings
    • C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV, 2018.
      Google ScholarLocate open access versionFindings
    • C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. Learning a discriminative feature network for semantic segmentation. In CVPR, 2018.
      Google ScholarLocate open access versionFindings
    • F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122, 2015.
      Findings
    • A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In CVPR, 2018.
      Google ScholarLocate open access versionFindings
    • Z. Zhang, A. G. Schwing, S. Fidler, and R. Urtasun. Monocular object instance segmentation and depth ordering with cnns. In CVPR, 2015.
      Google ScholarLocate open access versionFindings
    • Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun. Exfuse: Enhancing feature fusion for semantic segmentation. In ECCV, 2018.
      Google ScholarLocate open access versionFindings
    • H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments