Self-Supervised Scene De-occlusion

CVPR, pp. 3783-3791, 2020.

Cited by: 0|Bibtex|Views62|Links
EI
Keywords:
high qualitymultiple objectreal worldamodal instance segmentationinstance segmentationMore(16+)
Weibo:
We have proposed a unified scene deocclusion framework equipped with self-supervised Partial Completion Network trained without ordering or amodal annotations

Abstract:

Natural scene understanding is a challenging task, particularly when encountering images of multiple objects that are partially occluded. This obstacle is given rise by varying object ordering and positioning. Existing scene understanding paradigms are able to parse only the visible parts, resulting in incomplete and unstructured scene ...More

Code:

Data:

0
Introduction
  • Scene understanding is one of the foundations of machine perception. A real-world scene, regardless of its context, often comprises multiple objects of varying ordering and positioning, with one or more object(s) being occluded by other object(s).
  • The advent of advanced deep networks along with large-scale annotated datasets has facilitated many scene understanding tasks, e.g., object detection [4, 5, 6, 7], scene parsing [8, 9, 10], and instance segmentation [11, 12, 13, 14]
  • These tasks mainly concentrate on modal perception, while amodal perception remains rarely explored to date.
  • Depending on the category, orientation, and position of objects, the boundaries of “occludee(s)” are elusive; no simple priors can be applied to recover the invisible boundaries
Highlights
  • Scene understanding is one of the foundations of machine perception
  • The self-supervised partial completion approximates the supervised one, laying the foundation of our Partial Completion Network. Based on such a selfsupervised notion, we introduce Partial Completion Networks (PCNets)
  • We evaluate our method in various applications including ordering recovery, amodal completion, amodal instance segmentation, and scene manipulation
  • For baseline Convex, we compute convex hull on modal masks to approximate amodal completion, and the object with more increments is regarded as the occludee
  • We have proposed a unified scene deocclusion framework equipped with self-supervised Partial Completion Network trained without ordering or amodal annotations
  • The framework is applied in a progressive way to recover occlusion orderings, perform amodal and content completion
Methods
  • The authors evaluate the method in various applications including ordering recovery, amodal completion, amodal instance segmentation, and scene manipulation.
  • 1) KINS [18], originated from KITTI [20], is a large-scale traffic dataset with annotated modal and amodal masks of instances.
  • PCNets are trained on the training split (7,474 images, 95,311 instances) with modal annotations.
  • 2) COCOA [17] is a subset of COCO2014 [21] while annotated with pairwise ordering, modal, and amodal masks.
  • The authors train PCNets on the training split (2,500 images, 22,163 instances) us-
Results
  • For baseline Convex, the authors compute convex hull on modal masks to approximate amodal completion, and the object with more increments is regarded as the occludee.
  • All baselines have been adjusted to achieve their respective best performances.
  • On both benchmarks, the method achieves much higher accuracies than baselines, comparable to the supervised counterparts.
  • Convex represents computing the convex hull of the modal mask method
Conclusion
  • The authors have proposed a unified scene deocclusion framework equipped with self-supervised PCNets trained without ordering or amodal annotations.
  • The framework is applied in a progressive way to recover occlusion orderings, perform amodal and content completion.
  • It is applicable to convert existing modal annotations to amodal annotations.
  • Quantitative results show their equivalent efficacy to manual annotations.
  • The authors' framework enables high-quality occlusion-aware scene manipulation, providing a new dimension for image editing.
Summary
  • Introduction:

    Scene understanding is one of the foundations of machine perception. A real-world scene, regardless of its context, often comprises multiple objects of varying ordering and positioning, with one or more object(s) being occluded by other object(s).
  • The advent of advanced deep networks along with large-scale annotated datasets has facilitated many scene understanding tasks, e.g., object detection [4, 5, 6, 7], scene parsing [8, 9, 10], and instance segmentation [11, 12, 13, 14]
  • These tasks mainly concentrate on modal perception, while amodal perception remains rarely explored to date.
  • Depending on the category, orientation, and position of objects, the boundaries of “occludee(s)” are elusive; no simple priors can be applied to recover the invisible boundaries
  • Methods:

    The authors evaluate the method in various applications including ordering recovery, amodal completion, amodal instance segmentation, and scene manipulation.
  • 1) KINS [18], originated from KITTI [20], is a large-scale traffic dataset with annotated modal and amodal masks of instances.
  • PCNets are trained on the training split (7,474 images, 95,311 instances) with modal annotations.
  • 2) COCOA [17] is a subset of COCO2014 [21] while annotated with pairwise ordering, modal, and amodal masks.
  • The authors train PCNets on the training split (2,500 images, 22,163 instances) us-
  • Results:

    For baseline Convex, the authors compute convex hull on modal masks to approximate amodal completion, and the object with more increments is regarded as the occludee.
  • All baselines have been adjusted to achieve their respective best performances.
  • On both benchmarks, the method achieves much higher accuracies than baselines, comparable to the supervised counterparts.
  • Convex represents computing the convex hull of the modal mask method
  • Conclusion:

    The authors have proposed a unified scene deocclusion framework equipped with self-supervised PCNets trained without ordering or amodal annotations.
  • The framework is applied in a progressive way to recover occlusion orderings, perform amodal and content completion.
  • It is applicable to convert existing modal annotations to amodal annotations.
  • Quantitative results show their equivalent efficacy to manual annotations.
  • The authors' framework enables high-quality occlusion-aware scene manipulation, providing a new dimension for image editing.
Tables
  • Table1: Ordering estimation on COCOA validation and KINS testing sets, reported with pair-wise accuracy on occluded instance pairs
  • Table2: Amodal completion on COCOA validation and KINS testing sets, using ground truth modal masks
  • Table3: Amodal completion on KINS testing set, using predicted modal masks (mAP 52.7%)
  • Table4: Amodal instance segmentation on KINS testing set. ConvexR means using predicted order to refine the convex hull. In this experimental setting, all methods detect and segment instances from raw images. Hence, modal masks are not used in testing
Download tables as Excel
Related work
  • Ordering Recovery. In the unsupervised stream, Wu et al [22] propose to recover ordering by re-composing the scene with object templates. However, they only demonstrate the system on toy data. Tighe et al [23] build a prior occlusion matrix between classes on the training set and minimize quadratic programming to recover the ordering in testing. The inter-class occlusion prior ignores the complexity of realistic scenes. Other works [24, 25] rely on additional depth cues. However, depth is not reliable in occlusion reasoning, e.g., there is no depth difference if a piece of paper lies on a table. The assumption made by these works that farther objects are occluded by close ones also does not always hold. For example, as shown in Fig. 2. The plate (#1) is occluded by the coffee cup (#5), while the cup is farther in depth. In the supervised stream, several works manually annotate occlusion ordering [17, 18] or rely on synthetic data [16] to learn the ordering in a fully-supervised manner. Another stream of works on panoptic segmentation [26, 27] design end-to-end training procedures to resolve overlapping segments. However, they do not explicitly recover the full scene ordering. Amodal Instance Segmentation. Modal segmentation, such as semantic segmentation [9, 10] and instance segmentation [11, 12, 13], aims at assigning categorical or object labels to visible pixels. Existing approaches for modal segmentation are not able to solve the de-occlusion problem. Different from modal segmentation, amodal instance segmentation aims at detecting objects as well as recovering the amodal (integrated) masks of them. Li et al [28] produces dummy supervision through pasting artificial occluders, while the absence of explicit ordering increases the difficulty when complicated occlusion relationship is present. Other works take a fully-supervised learning approach by using either manual annotations [17, 18, 19] or synthetic data [16]. As mentioned above, it is costly and inaccurate to annotate invisible masks manually. Approaches relying on synthetic data are also confronted with domain gap issues. On the contrary, our approach can convert modal masks into amodal masks in a self-supervised manner. This unique ability facilitates the training of amodal instance segmentation networks without manual amodal annotations. Amodal Completion. Amodal completion is slightly different from amodal instance segmentation. In amodal completion, modal masks are given at test time and the task is to complete the modal masks into amodal masks. Previous works on amodal completion typically rely on heuristic assumptions on the invisible boundaries to perform amodal completion with given ordering relationships. Kimia et al [29] propose to adopt Euler Spiral in amodal completion. Lin et al [30] use cubic Bezier curves. Silberman et al [31] apply curve primitives including straight lines and parabolas. Since these studies still require ordering as the input, they cannot be adopted directly to solve de-occlusion problem. Besides, these unsupervised approaches mainly focus on toy examples with simple shapes. Kar et al [32] use keypoint annotations to align 3D object templates to 2D image objects, so as to generate the ground truth of amodal bounding boxes. Ehsani et al [15] leverage 3D synthetic data to train an end-to-end amodal completion network. Similar to unsupervised methods, our framework does not need annotations of amodal masks or any kind of 3D/synthetic data. In contrast, our approach is able to solve amodal completion in highly cluttered natural scenes, whereas other unsupervised methods fall short.
Funding
  • Acknowledgement: This work is supported by the SenseTimeNTU Collaboration Project, Collaborative Research grant from SenseTime Group (CUHK Agreement No TS1610626 & No TS1712093), Singapore MOE AcRF Tier 1 (2018-T1-002-056), NTU SUG, and NTU NAP
Reference
  • Gaetano Kanizsa. Organization in vision: Essays on Gestalt perception. Praeger Publishers, 1979.
    Google ScholarFindings
  • Stephen E Palmer. Vision science: Photons to phenomenology. MIT press, 1999.
    Google ScholarFindings
  • Steven Lehar. Gestalt isomorphism and the quantification of spatial perception. Gestalt theory, 21:122–139, 1999.
    Google ScholarLocate open access versionFindings
  • Pedro F Felzenszwalb, Ross B Girshick, and David McAllester. Cascade object detection with deformable part models. In CVPR, pages 2241–2248. IEEE, 2010.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and Dahua Lin. Region proposal by guided anchoring. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Kai Chen, Jiaqi Wang, Shuo Yang, Xingcheng Zhang, Yuanjun Xiong, Chen Change Loy, and Dahua Lin. Optimizing video object detection via a scale-time lattice. In CVPR, June 2018.
    Google ScholarLocate open access versionFindings
  • Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Semantic image segmentation via deep parsing network. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI, 40(4):834–848, 2017.
    Google ScholarLocate open access versionFindings
  • Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, pages 2881–2890, 2017.
    Google ScholarLocate open access versionFindings
  • Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, and Jian Sun. Instance-sensitive fully convolutional networks. In ECCV, pages 534–549.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In CVPR, pages 4974–4983, 2019.
    Google ScholarLocate open access versionFindings
  • Jiaqi Wang, Kai Chen, Rui Xu, Ziwei Liu, Chen Change Loy, and Dahua Lin. Carafe: Content-aware reassembly of features. In ICCV, October 2019.
    Google ScholarFindings
  • Kiana Ehsani, Roozbeh Mottaghi, and Ali Farhadi. Segan: Segmenting and generating the invisible. In CVPR, pages 6144–6153, 2018.
    Google ScholarLocate open access versionFindings
  • Yuan-Ting Hu, Hong-Shuo Chen, Kexin Hui, Jia-Bin Huang, and Alexander G Schwing. Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In CVPR, pages 3105–3115, 2019.
    Google ScholarLocate open access versionFindings
  • Yan Zhu, Yuandong Tian, Dimitris Metaxas, and Piotr Dollar. Semantic amodal segmentation. In CVPR, pages 1464–1472, 2017.
    Google ScholarLocate open access versionFindings
  • Lu Qi, Li Jiang, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Amodal instance segmentation with kins dataset. In CVPR, pages 3014–3023, 2019.
    Google ScholarLocate open access versionFindings
  • Patrick Follmann, Rebecca Ko Nig, Philipp Ha Rtinger, Michael Klostermann, and Tobias Bo Ttger. Learning to see the invisible: End-to-end trainable amodal instance segmentation. In WACV, pages 1328–1336. IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, and Piotr Dollar. Microsoft coco: Common objects in context. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • Jiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. Neural scene de-rendering. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Joseph Tighe, Marc Niethammer, and Svetlana Lazebnik. Scene parsing with object instances and occlusion ordering. In CVPR, pages 3748–3755, 2014.
    Google ScholarLocate open access versionFindings
  • Derek Hoiem, Andrew N Stein, Alexei A Efros, and Martial Hebert. Recovering occlusion boundaries from a single image. In ICCV, 2007.
    Google ScholarLocate open access versionFindings
  • Pulak Purkait, Christopher Zach, and Ian Reid. Seeing behind things: Extending semantic segmentation to occluded regions. arXiv preprint arXiv:1906.02885, 2019.
    Findings
  • Huanyu Liu, Chao Peng, Changqian Yu, Jingbo Wang, Xu Liu, Gang Yu, and Wei Jiang. An end-to-end network for panoptic segmentation. In CVPR, pages 6172–6181, 2019.
    Google ScholarLocate open access versionFindings
  • Justin Lazarow, Kwonjoon Lee, and Zhuowen Tu. Learning instance occlusion for panoptic segmentation. arXiv preprint arXiv:1906.05896, 2019.
    Findings
  • Ke Li and Jitendra Malik. Amodal instance segmentation. In ECCV, pages 677–693.
    Google ScholarLocate open access versionFindings
  • Benjamin B Kimia, Ilana Frankel, and Ana-Maria Popescu. Euler spiral for shape completion. IJCV, 54(1-3):159–182, 2003.
    Google ScholarLocate open access versionFindings
  • Hongwei Lin, Zihao Wang, Panpan Feng, Xingjiang Lu, and Jinhui Yu. A computational model of topological and geometric recovery for visual curve completion. Computational Visual Media, 2(4):329–342, 2016.
    Google ScholarLocate open access versionFindings
  • Nathan Silberman, Lior Shapira, Ran Gal, and Pushmeet Kohli. A contour completion model for augmenting surface reconstructions. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Amodal completion and size constancy in natural scenes. In ICCV, pages 127–135, 2015.
    Google ScholarLocate open access versionFindings
  • Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241.
    Google ScholarLocate open access versionFindings
  • Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments