Weakly- and Semi-Supervised Learning of a DCNN for Semantic Image Segmentation

CoRR, 2015.

Cited by: 349|Bibtex|Views260|Links
EI
Keywords:
image segmentation modelsegmentation benchmarkstochastic gradient descentfield of viewimage levelMore(10+)
Weibo:
Various approaches have been tried over the years, but according to the results on the challenging Pascal VOC 2012 segmentation benchmark, the best performing methods all use some kind of Deep Convolutional Neural Network

Abstract:

Deep convolutional neural networks (DCNNs) trained on a large number of images with strong pixel-level annotations have recently significantly pushed the state-of-art in semantic image segmentation. We study the more challenging problem of learning DCNNs for semantic image segmentation from either (1) weakly annotated training data such...More

Code:

Data:

0
Introduction
  • Semantic image segmentation refers to the problem of assigning a semantic label to every pixel in the image.
  • The authors work with the DeepLab-CRF approach of [5, 41]
  • This combines a DCNN with a fully connected Conditional Random Field (CRF) [19], in order to get high resolution segmentations.
  • This model achieves state-ofart results on the challenging PASCAL VOC segmentation benchmark [13], delivering a mean intersection-over-union (IOU) score exceeding 70%.
  • In which the authors achieve performance up to 69.0%, demonstrate the effectiveness of the proposed techniques
Highlights
  • Semantic image segmentation refers to the problem of assigning a semantic label to every pixel in the image
  • Various approaches have been tried over the years, but according to the results on the challenging Pascal VOC 2012 segmentation benchmark, the best performing methods all use some kind of Deep Convolutional Neural Network (DCNN) [2, 5, 8, 14, 25, 27, 41]
  • We develop new methods for training Deep convolutional neural networks image segmentation models from weak annotations, either alone or in combination with a small number of strong annotations
  • We develop novel online Expectation-Maximization (EM) methods for training Deep convolutional neural networks semantic segmentation models from weakly annotated data
  • We focus for simplicity on methods for training the Deep convolutional neural networks parameters from weak labels, only using the Conditional Random Field at test time
  • In which we achieve performance up to 69.0%, demonstrate the effectiveness of the proposed techniques
  • Qualitative Segmentation Results In Fig. 6 we provide visual comparisons of the results obtained by the DeepLab-Conditional Random Field model learned with some of the proposed training methods
Methods
  • The authors build on the DeepLab model for semantic image segmentation proposed in [5]. This uses a DCNN to predict the label distribution per pixel, followed by a fully-connected (dense) CRF [19] to smooth the predictions while preserving image edges.
  • The authors build on the DeepLab model for semantic image segmentation proposed in [5]
  • This uses a DCNN to predict the label distribution per pixel, followed by a fully-connected CRF [19] to smooth the predictions while preserving image edges.
  • See [41], which uses joint PASCAL and MS-COCO training, and further improves performance (74.7%) by end-to-end learning of the DCNN and CRF parameters
Results
  • In Fig. 6 the authors provide visual comparisons of the results obtained by the DeepLab-CRF model learned with some of the proposed training methods.
Conclusion
  • The paper has explored the use of weak or partial annotation in training a state of art semantic image segmentation model.
  • Extensive experiments on the challenging PASCAL VOC 2012 dataset have shown that: (1) Using weak annotation solely at the image-level seems insufficient to train a high-quality segmentation model.
  • (2) Using weak bounding-box annotation in conjunction with careful segmentation inference for images in the training set suffices to train a competitive model.
  • (4) Exploiting extra weak or strong annotations from other datasets can lead to large improvements
  • Extensive experiments on the challenging PASCAL VOC 2012 dataset have shown that: (1) Using weak annotation solely at the image-level seems insufficient to train a high-quality segmentation model. (2) Using weak bounding-box annotation in conjunction with careful segmentation inference for images in the training set suffices to train a competitive model. (3) Excellent performance is obtained when combining a small number of pixel-level annotated images with a large number of weakly annotated images in a semi-supervised setting, nearly matching the results achieved when all training images have pixel-level annotations. (4) Exploiting extra weak or strong annotations from other datasets can lead to large improvements
Summary
  • Introduction:

    Semantic image segmentation refers to the problem of assigning a semantic label to every pixel in the image.
  • The authors work with the DeepLab-CRF approach of [5, 41]
  • This combines a DCNN with a fully connected Conditional Random Field (CRF) [19], in order to get high resolution segmentations.
  • This model achieves state-ofart results on the challenging PASCAL VOC segmentation benchmark [13], delivering a mean intersection-over-union (IOU) score exceeding 70%.
  • In which the authors achieve performance up to 69.0%, demonstrate the effectiveness of the proposed techniques
  • Methods:

    The authors build on the DeepLab model for semantic image segmentation proposed in [5]. This uses a DCNN to predict the label distribution per pixel, followed by a fully-connected (dense) CRF [19] to smooth the predictions while preserving image edges.
  • The authors build on the DeepLab model for semantic image segmentation proposed in [5]
  • This uses a DCNN to predict the label distribution per pixel, followed by a fully-connected CRF [19] to smooth the predictions while preserving image edges.
  • See [41], which uses joint PASCAL and MS-COCO training, and further improves performance (74.7%) by end-to-end learning of the DCNN and CRF parameters
  • Results:

    In Fig. 6 the authors provide visual comparisons of the results obtained by the DeepLab-CRF model learned with some of the proposed training methods.
  • Conclusion:

    The paper has explored the use of weak or partial annotation in training a state of art semantic image segmentation model.
  • Extensive experiments on the challenging PASCAL VOC 2012 dataset have shown that: (1) Using weak annotation solely at the image-level seems insufficient to train a high-quality segmentation model.
  • (2) Using weak bounding-box annotation in conjunction with careful segmentation inference for images in the training set suffices to train a competitive model.
  • (4) Exploiting extra weak or strong annotations from other datasets can lead to large improvements
  • Extensive experiments on the challenging PASCAL VOC 2012 dataset have shown that: (1) Using weak annotation solely at the image-level seems insufficient to train a high-quality segmentation model. (2) Using weak bounding-box annotation in conjunction with careful segmentation inference for images in the training set suffices to train a competitive model. (3) Excellent performance is obtained when combining a small number of pixel-level annotated images with a large number of weakly annotated images in a semi-supervised setting, nearly matching the results achieved when all training images have pixel-level annotations. (4) Exploiting extra weak or strong annotations from other datasets can lead to large improvements
Tables
  • Table1: VOC 2012 val performance for varying number of pixellevel (strong) and image-level (weak) annotations (Sec. 4.3)
  • Table2: VOC 2012 test performance for varying number of pixellevel (strong) and image-level (weak) annotations (Sec. 4.3)
  • Table3: VOC 2012 val performance for varying number of pixellevel (strong) and bounding box (weak) annotations (Sec. 4.4)
  • Table4: VOC 2012 test performance for varying number of pixellevel (strong) and bounding box (weak) annotations (Sec. 4.4)
  • Table5: VOC 2012 val performance using strong annotations for all 10,582 train aug PASCAL images and a varying number of strong and weak MS-COCO annotations (Sec. 4.5)
  • Table6: VOC 2012 test performance using PASCAL and MS-
Download tables as Excel
Related work
  • Training segmentation models with only image-level labels has been a challenging problem in the literature [12, 36, 37, 39]. Our work is most related to other recent DCNN models such as [30, 31], who also study the weakly supervised setting. They both develop MIL-based algorithms for the problem. In contrast, our model employs an EM algorithm, which similarly to [26] takes into account the weak labels when inferring the latent image segmentations. Moreover, [31] proposed to smooth the prediction results by region proposal algorithms, e.g., CPMC [3] and MCG [1], learned on pixel-segmented images. Neither [30, 31] cover the semi-supervised setting.
Funding
  • This work was partly supported by ARO 62250-CS, and NIH 5R01EY022247-03
  • We also gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used for this research. EM-Fixed (Semi) Bbox-EM-Fixed (Semi) Cross-Joint (Strong)
Reference
  • P. Arbelaez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material recognition in the wild with the materials in context database. arXiv:1412.0623, 2014.
    Findings
  • J. Carreira and C. Sminchisescu. CPMC: Automatic object segmentation using constrained parametric min-cuts. PAMI, 34(7):1312–1328, 2012.
    Google ScholarLocate open access versionFindings
  • L.-C. Chen, S. Fidler, A. L. Yuille, and R. Urtasun. Beat the mturkers: Automatic image labeling from weak 3d supervision. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • L.-C. Chen, A. Schwing, A. Yuille, and R. Urtasun. Learning deep structured models. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr. BING: Binarized normed gradients for objectness estimation at 300fps. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. arXiv:1412.1283, 2014.
    Findings
  • J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. arXiv:1503.01640, 2015.
    Findings
  • A. Delong, A. Osokin, H. N. Isack, and Y. Boykov. Fast approximate energy minimization with label costs. IJCV, 2012.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • P. Duygulu, K. Barnard, J. F. de Freitas, and D. A. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In ECCV, 2002.
    Google ScholarLocate open access versionFindings
  • M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserma. The pascal visual object classes challenge a retrospective. IJCV, 2014.
    Google ScholarLocate open access versionFindings
  • C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. PAMI, 2013.
    Google ScholarLocate open access versionFindings
  • M. Guillaumin, D. Kuttel, and V. Ferrari. Imagenet auto-annotation with segmentation propagation. IJCV, 110(3):328–348, 2014.
    Google ScholarLocate open access versionFindings
  • B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011.
    Google ScholarLocate open access versionFindings
  • B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. arXiv:1411.5752, 2014.
    Findings
  • Y. Jia et al. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
    Findings
  • P. Krahenbuhl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.
    Google ScholarLocate open access versionFindings
  • H. Kuck and N. de Freitas. Learning about individuals from group statistics. In UAI, 2005.
    Google ScholarLocate open access versionFindings
  • M. P. Kumar, H. Turki, D. Preston, and D. Koller. Learning specific-class segmentation from diverse data. In ICCV, 2011.
    Google ScholarLocate open access versionFindings
  • V. Lempitsky, P. Kohli, C. Rother, and T. Sharp. Image segmentation with a bounding box prior. In ICCV, 2009.
    Google ScholarLocate open access versionFindings
  • Y. Li and R. Zemel. High order regularization for semisupervised learning of structured output problems. In ICML, 2014.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin et al. Microsoft COCO: Common objects in context. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. arXiv:1411.4038, 2014.
    Findings
  • W.-L. Lu, J.-A. Ting, J. J. Little, and K. P. Murphy. Learning to track and identify players from broadcast sports videos. PAMI, 2013.
    Google ScholarLocate open access versionFindings
  • M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feedforward semantic segmentation with zoom-out features. arXiv:1412.0774, 2014.
    Findings
  • M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Weakly supervised object recognition with convolutional neural networks. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • G. Papandreou, I. Kokkinos, and P.-A. Savalle. Untangling local and global deformations in deep convolutional networks for image classification and sliding window detection. arXiv:1412.0296, 2014.
    Findings
  • D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully convolutional multi-class multiple instance learning. arXiv:1412.7144, 2014.
    Findings
  • P. Pinheiro and R. Collobert. From image-level to pixel-level labeling with convolutional networks. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • P. Pletscher and P. Kohli. Learning low-order models for enforcing high-order statistics. In AISTATS, 2012.
    Google ScholarLocate open access versionFindings
  • C. Rother, V. Kolmogorov, and A. Blake. GrabCut: Interactive foreground extraction using iterated graph cuts. In SIGGRAPH, 2004.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
    Findings
  • D. Tarlow, K. Swersky, R. S. Zemel, R. P. Adams, and B. J. Frey. Fast exact inference for recursive cardinality models. In UAI, 2012.
    Google ScholarLocate open access versionFindings
  • J. Verbeek and B. Triggs. Region classification with markov field aspect models. In CVPR, 2007.
    Google ScholarLocate open access versionFindings
  • A. Vezhnevets, V. Ferrari, and J. M. Buhmann. Weakly supervised structured output learning for semantic segmentation. In CVPR, 2012.
    Google ScholarLocate open access versionFindings
  • W. Xia, C. Domokos, J. Dong, L.-F. Cheong, and S. Yan. Semantic segmentation without annotating segments. In ICCV, 2013.
    Google ScholarFindings
  • J. Xu, A. G. Schwing, and R. Urtasun. Tell me what you see and I will show you where it is. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • J. Xu, A. G. Schwing, and R. Urtasun. Learning to segment under various forms of weak supervision. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional random fields as recurrent neural networks. arXiv:1502.03240, 2015.
    Findings
  • J. Zhu, J. Mao, and A. L. Yuille. Learning from weakly supervised data by the expectation loss svm (e-svm) algorithm. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments