PointRend: Image Segmentation as Rendering

CVPR, pp. 9796-9805, 2019.

Cited by: 20|Bibtex|Views295|Links
EI
Keywords:
segmentation taskinstance segmentationfeature pyramid networksemantic segmentationFully convolutional networksMore(13+)
Weibo:
Convolutional neural networks for image segmentation typically operate on regular grids: the input image is a regular grid of pixels, their hidden representations are feature vectors on a regular grid, and their outputs are label maps on a regular grid

Abstract:

We present a new method for efficient high-quality image segmentation of objects and scenes. By analogizing classical computer graphics methods for efficient rendering with over- and undersampling challenges faced in pixel labeling tasks, we develop a unique perspective of image segmentation as a rendering problem. From this vantage, we...More

Code:

Data:

0
Introduction
  • Image segmentation tasks involve mapping pixels sampled on a regular grid to a label map, or a set of label maps, on the same grid.
  • A PointRend module accepts one or more typical CNN feature maps f that are defined over regular grids, and outputs high-resolution predictions p(x′i, yi′) over a finer grid.
  • A PointRend module accepts one or more typical CNN feature maps of C channels f ∈ RC×H×W , each defined over a regular grid, and outputs predictions for the K class labels p ∈ RK×H′×W ′ over a regular grid of different resolution.
Highlights
  • Image segmentation tasks involve mapping pixels sampled on a regular grid to a label map, or a set of label maps, on the same grid
  • The modern tools of choice for these tasks are built on convolutional neural networks (CNNs) [27, 26]
  • Convolutional neural networks for image segmentation typically operate on regular grids: the input image is a regular grid of pixels, their hidden representations are feature vectors on a regular grid, and their outputs are label maps on a regular grid
  • The label maps predicted by these networks should be mostly smooth, i.e., neighboring pixels often take the same label, because highfrequency regions are restricted to the sparse boundaries between objects
  • A PointRend module accepts one or more typical convolutional neural networks feature maps f that are defined over regular grids, and outputs high-resolution predictions p(x′i, yi′) over a finer grid
Results
  • A PointRend module consists of three main components: (i) A point selection strategy chooses a small number of real-value points to make predictions on, avoiding excessive computation for all pixels in the high-resolution output grid.
  • It computes masks in a coarse-to-fine fashion by making predictions over a set of selected points.
  • PointRend upsamples its previously predicted segmentation using bilinear interpolation and selects the N most uncertain points on this denser grid.
  • Predictions and loss functions are only computed on the N sampled points, which is simpler and more efficient than backpropagation through subdivision steps.
  • To allow PointRend to render fine segmentation details the authors extract a feature vector at each sampled point from CNN feature maps.
  • The coarse prediction can be, for example, the output of a lightweight 7×7 resolution mask head in Mask R-CNN.
  • Given the point-wise feature representation at each selected point, PointRend makes point-wise segmentation predictions using a simple multi-layer perceptron (MLP).
  • A K-dimensional feature vector is extracted from the coarse prediction head’s output using bilinear interpolation.
  • The authors found that training as a cascade does not improve the baseline Mask R-CNN, but PointRend can benefit from it by sampling points inside more accurate boxes, slightly improving overall performance (∼0.2% AP, absolute).
  • Subdivision inference allows PointRend to yield a high resolution 224×224 prediction using more than 30 times less compute (FLOPs) and memory than the default 4× conv head needs to output the same resolution, see Table 2.
  • Table 3 shows PointRend subdivision inference with different output resolutions and number of points selected at each subdivision step.
Conclusion
  • The authors demonstrate that PointRend can benefit two semantic segmentation models: DeeplabV3 [5], which uses dilated convolutions to make prediction on a denser grid, and SemanticFPN [24], a simple encoder-decoder architecture.
  • Coarse prediction features come from the output of the semantic segmentation model.
  • During training the authors sample as many points as there are on a stride 16 feature map of the input (2304 for deeplabV3 and 2048 for SemanticFPN).
Summary
  • Image segmentation tasks involve mapping pixels sampled on a regular grid to a label map, or a set of label maps, on the same grid.
  • A PointRend module accepts one or more typical CNN feature maps f that are defined over regular grids, and outputs high-resolution predictions p(x′i, yi′) over a finer grid.
  • A PointRend module accepts one or more typical CNN feature maps of C channels f ∈ RC×H×W , each defined over a regular grid, and outputs predictions for the K class labels p ∈ RK×H′×W ′ over a regular grid of different resolution.
  • A PointRend module consists of three main components: (i) A point selection strategy chooses a small number of real-value points to make predictions on, avoiding excessive computation for all pixels in the high-resolution output grid.
  • It computes masks in a coarse-to-fine fashion by making predictions over a set of selected points.
  • PointRend upsamples its previously predicted segmentation using bilinear interpolation and selects the N most uncertain points on this denser grid.
  • Predictions and loss functions are only computed on the N sampled points, which is simpler and more efficient than backpropagation through subdivision steps.
  • To allow PointRend to render fine segmentation details the authors extract a feature vector at each sampled point from CNN feature maps.
  • The coarse prediction can be, for example, the output of a lightweight 7×7 resolution mask head in Mask R-CNN.
  • Given the point-wise feature representation at each selected point, PointRend makes point-wise segmentation predictions using a simple multi-layer perceptron (MLP).
  • A K-dimensional feature vector is extracted from the coarse prediction head’s output using bilinear interpolation.
  • The authors found that training as a cascade does not improve the baseline Mask R-CNN, but PointRend can benefit from it by sampling points inside more accurate boxes, slightly improving overall performance (∼0.2% AP, absolute).
  • Subdivision inference allows PointRend to yield a high resolution 224×224 prediction using more than 30 times less compute (FLOPs) and memory than the default 4× conv head needs to output the same resolution, see Table 2.
  • Table 3 shows PointRend subdivision inference with different output resolutions and number of points selected at each subdivision step.
  • The authors demonstrate that PointRend can benefit two semantic segmentation models: DeeplabV3 [5], which uses dilated convolutions to make prediction on a denser grid, and SemanticFPN [24], a simple encoder-decoder architecture.
  • Coarse prediction features come from the output of the semantic segmentation model.
  • During training the authors sample as many points as there are on a stride 16 feature map of the input (2304 for deeplabV3 and 2048 for SemanticFPN).
Tables
  • Table1: PointRend vs. the default 4× conv mask head for Mask R-CNN [<a class="ref-link" id="c19" href="#r19">19</a>]. Mask AP is reported. AP⋆ is COCO mask AP evaluated against the higher-quality LVIS annotations [<a class="ref-link" id="c16" href="#r16">16</a>] (see text for details). A ResNet-50-FPN backbone is used for both COCO and Cityscapes models. PointRend outperforms the standard 4× conv mask head both quantitively and qualitatively. Higher output resolution leads to more detailed predictions, see Fig. 2 and Fig. 6
  • Table2: FLOPs (multiply-adds) and activation counts for a 224×224 output resolution mask. PointRend’s efficient subdivision makes 224×224 output feasible in contrast to the standard 4× conv mask head modified to use an RoIAlign size of 112×112
  • Table3: Subdivision inference parameters. Higher output resolution improves AP. Although improvements saturate quickly (at underlined values) with the number of points sampled at each subdivision step, qualitative results may continue to improve for complex objects. AP⋆ is COCO mask AP evaluated against the higherquality LVIS annotations [<a class="ref-link" id="c16" href="#r16">16</a>] (see text for details)
  • Table4: Training-time point selection strategies with 142 points per box. Mildly biasing sampling towards uncertain regions performs the best. Heavily biased sampling performs even worse than uniform or regular grid sampling indicating the importance of coverage. AP⋆ is COCO mask AP evaluated against the higher-quality LVIS annotations [<a class="ref-link" id="c16" href="#r16">16</a>] (see text for details)
  • Table5: Larger models and a longer 3× schedule [<a class="ref-link" id="c18" href="#r18">18</a>]. PointRend benefits from more advanced models and the longer training. The gap between PointRend and the default mask head in Mask R-CNN holds. AP⋆ is COCO mask AP evaluated against the higher-quality LVIS annotations [<a class="ref-link" id="c16" href="#r16">16</a>] (see text for details)
  • Table6: DeeplabV3 with PointRend for Cityscapes semantic segmentation outperforms baseline DeepLabV3. Dilating the res4 stage during inference yields a larger, more accurate prediction, but at much higher computational and memory costs; it is still inferior to using PointRend
  • Table7: SemanticFPN with PointRend for Cityscapes semantic segmentation outperform the baseline SemanticFPN
Download tables as Excel
Related work
  • Rendering algorithms in computer graphics output a regular grid of pixels. However, they usually compute these pixel values over a non-uniform set of points. Efficient procedures like subdivision [48] and adaptive sampling [38, 42] refine a coarse rasterization in areas where pixel values have larger variance. Ray-tracing renderers often use oversampling [50], a technique that samples some points more densely than the output grid to avoid aliasing effects. Here, we apply classical subdivision to image segmentation.

    Non-uniform grid representations. Computation on regular grids is the dominant paradigm for 2D image analysis, but this is not the case for other vision tasks. In 3D shape recognition, large 3D grids are infeasible due to cubic scaling. Most CNN-based approaches do not go beyond coarse 64×64×64 grids [12, 8]. Instead, recent works consider more efficient non-uniform representations such as meshes [47, 14], signed distance functions [37], and octrees [46]. Similar to a signed distance function, PointRend can compute segmentation values at any point.
Funding
  • Presents a new method for efficient high-quality image segmentation of objects and scenes
  • Develops a unique perspective of image segmentation as a rendering problem
  • Presents the PointRend neural network module: a module that performs point-based segmentation predictions at adaptively selected locations based on an iterative subdivision algorithm
  • Introduces the PointRend module that makes predictions at adaptively sampled points on the image using a new pointbased feature representation
Reference
  • Anurag Arnab and Philip HS Torr. Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, 2017. 3
    Google ScholarLocate open access versionFindings
  • Samuel Rota Bulo, Lorenzo Porzi, and Peter Kontschieder. In-place activated batchnorm for memory-optimized training of DNNs. In CVPR, 2018. 9
    Google ScholarLocate open access versionFindings
  • Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In CVPR, 2019. 3
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. PAMI, 2018. 3
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017. 2, 3, 8, 9
    Findings
  • Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. 3
    Google ScholarLocate open access versionFindings
  • Xinlei Chen, Ross Girshick, Kaiming He, and Piotr Dollar. TensorMask: A foundation for dense object segmentation. In ICCV, 2019. 3
    Google ScholarLocate open access versionFindings
  • Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In ECCV, 2016. 3
    Google ScholarLocate open access versionFindings
  • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 2, 3, 5, 8, 9
    Google ScholarLocate open access versionFindings
  • Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, 2017. 5
    Google ScholarLocate open access versionFindings
  • Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The PASCAL visual object classes challenge: A retrospective. IJCV, 2015. 6
    Google ScholarLocate open access versionFindings
  • Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable and generative vector representation for objects. In ECCV, 2016. 3
    Google ScholarLocate open access versionFindings
  • Ross Girshick. Fast R-CNN. In ICCV, 2015. 5
    Google ScholarLocate open access versionFindings
  • Georgia Gkioxari, Jitendra Malik, and Justin Johnson. Mesh R-CNN. In ICCV, 2019. 3
    Google ScholarLocate open access versionFindings
  • Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017. 9
    Findings
  • Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. In ICCV, 2019. 5, 6, 7, 9
    Google ScholarLocate open access versionFindings
  • Bharath Hariharan, Pablo Arbelaez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015. 5
    Google ScholarLocate open access versionFindings
  • Kaiming He, Ross Girshick, and Piotr Dollar. Rethinking imagenet pre-training. In ICCV, 2019. 7
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. In ICCV, 2017. 1, 2, 3, 4, 5, 6
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 2, 5, 8
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 9
    Google ScholarLocate open access versionFindings
  • Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In NIPS, 2015. 5
    Google ScholarLocate open access versionFindings
  • Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017. 5
    Google ScholarLocate open access versionFindings
  • Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollar. Panoptic feature pyramid networks. In CVPR, 2019. 3, 8, 9
    Google ScholarLocate open access versionFindings
  • Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. InstanceCut: from edges to instances with multicut. In CVPR, 2017. 3
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 1
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989. 1
    Google ScholarFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 2, 5
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 2, 3, 5, 9
    Google ScholarLocate open access versionFindings
  • Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Autodeeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, 2019. 3
    Google ScholarLocate open access versionFindings
  • Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. SGN: Sequential grouping networks for instance segmentation. In CVPR, 2017. 3
    Google ScholarLocate open access versionFindings
  • Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In CVPR, 2018. 3
    Google ScholarLocate open access versionFindings
  • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. In ECCV, 2016. 9
    Google ScholarLocate open access versionFindings
  • Wei Liu, Andrew Rabinovich, and Alexander C Berg. Parsenet: Looking wider to see better. arXiv:1506.04579, 2015. 9
    Findings
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 1, 2, 3, 4
    Google ScholarLocate open access versionFindings
  • Dmitrii Marin, Zijian He, Peter Vajda, Priyam Chatterjee, Sam Tsai, Fei Yang, and Yuri Boykov. Efficient segmentation: Learning downsampling near semantic boundaries. In ICCV, 2019. 3
    Google ScholarLocate open access versionFindings
  • Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In CVPR, 2019. 3
    Google ScholarLocate open access versionFindings
  • Don P Mitchell. Generating antialiased images at low sampling densities. ACM SIGGRAPH Computer Graphics, 1987. 2
    Google ScholarLocate open access versionFindings
  • Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010. 6
    Google ScholarLocate open access versionFindings
  • Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In CVPR, 2017. 3
    Google ScholarLocate open access versionFindings
  • Paphio. Jo-Wilfried Tsonga [19]. CC BY-NC-SA 2.0. https://www.flickr.com/photos/paphio/2855627782/, 2008.1
    Findings
  • Matt Pharr, Wenzel Jakob, and Greg Humphreys. Physically based rendering: From theory to implementation, chapter 7. Morgan Kaufmann, 2016. 2
    Google ScholarFindings
  • Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. In CVPR, 2017. 5
    Google ScholarLocate open access versionFindings
  • Olaf Ronneberger, Philipp Fischer, and Thomas Brox. UNet: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. 3
    Google ScholarLocate open access versionFindings
  • Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions. arXiv:1904.04514, 2019. 3
    Findings
  • Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3D outputs. In ICCV, 2017. 3
    Google ScholarLocate open access versionFindings
  • Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2Mesh: Generating 3D mesh models from single RGB images. In ECCV, 2018. 3
    Google ScholarLocate open access versionFindings
  • Turner Whitted. An improved illumination model for shaded display. In ACM SIGGRAPH Computer Graphics, 1979. 2, 4
    Google ScholarLocate open access versionFindings
  • Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.6
    Findings
  • Kun Zhou, Qiming Hou, Rui Wang, and Baining Guo. Realtime kd-tree construction on graphics hardware. In ACM Transactions on Graphics (TOG), 2008. 2
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments