Adaptive Affinity Field for Semantic Segmentation

ECCV, 2018.

Cited by: 65|Bibtex|Views25|Links
EI
Keywords:
convolutional neural netsGenerative Adversarial Networksaffinity lossAdaptive Affinity Fieldsmean Intersection over UnionMore(11+)
Weibo:
We propose adaptive affinity fields for semantic segmentation, which incorporate geometric regularities into segmentation models, and learn local relations with adaptive ranges through adversarial training

Abstract:

Existing semantic segmentation methods mostly rely on per-pixel supervision, unable to capture structural regularity present in natural images. Instead of learning to enforce semantic labels on individual pixels, we propose to enforce affinity field patterns in individual pixel neighbourhoods, i.e., the semantic label patterns of whether ...More

Code:

Data:

0
Introduction
  • Semantic segmentation of an image refers to the challenging task of assigning each pixel a categorical label, e.g., motorcycle or person.
  • Even with big training data and with deeper and more complex network architectures, pixel-wise classification based approaches fundamentally lack the spatial discrimination power when foreground pixels and background pixels are close or mixed together: Segmentation is poor when the visual evidence for the foreground is weak, T.-W.
  • Yu e.g., glass motorcycle shields, or when the spatial structure is small, e.g., thin radial spokes of all the wheels (Fig. 1c)
Highlights
  • Semantic segmentation of an image refers to the challenging task of assigning each pixel a categorical label, e.g., motorcycle or person
  • Instead of enforcing semantic labels on individual pixels and matching labels between neighbouring pixels using Conditional Random Fields (CRF) or Generative Adversarial Networks (GAN), we propose the concept of Adaptive Affinity Fields (AAF) to capture and match the relations between neighbouring pixels in the label space
  • We briefly describe other popular methods that are used for comparison in our experiments, namely, GAN’s adversarial learning [12], contrastive loss [9], and CRF [15]
  • We demonstrate that our proposed AAF achieve 82.17% and 79.07% mean Intersection over Union (mIoU), which is better than the PSPNet by 1.54% and 2.77% and competitive to the state-of-the-art performance
  • We measure the instance-wise mIoU on VOC and Cityscapes validation set as summarized in Table 6 and Table 7, respectively In instance-wise mIoU, our AAF is higher than base architecture by 3.94% on VOC and 2.94% on Cityscapes
  • We propose adaptive affinity fields (AAF) for semantic segmentation, which incorporate geometric regularities into segmentation models, and learn local relations with adaptive ranges through adversarial training
Methods
  • Methods of Comparison

    The authors briefly describe other popular methods that are used for comparison in the experiments, namely, GAN’s adversarial learning [12], contrastive loss [9], and CRF [15].

    GAN’s Adversarial Learning.
  • The authors briefly describe other popular methods that are used for comparison in the experiments, namely, GAN’s adversarial learning [12], contrastive loss [9], and CRF [15].
  • The authors investigate a popular framework, the Generative Adversarial Networks (GAN) [12].
  • The discriminator D in GAN works as injecting priors for region structures.
  • The adversarial loss is formulated as.
  • The contrastive loss is formulated as
Results
  • The authors benchmark the proposed methods on two datasets, PASCAL VOC 2012 [11] and Cityscapes [10].
  • On PASCAL VOC 2012, the training procedure for PSPNet and AAF is the same as follows: The authors first train the networks on train aug and fine-tune on train val.
  • The authors measure the instance-wise mIoU on VOC and Cityscapes validation set as summarized in Table 6 and Table 7, respectively In instance-wise mIoU, the AAF is higher than base architecture by 3.94% on VOC and 2.94% on Cityscapes.
Conclusion
  • The authors' affinity loss encourages similar network predictions on two pixels of the same ground-truth label, regardless of what their actual labels are
  • The collection of such pairwise bonds inside a segment ensure that all the pixels achieve the same label.
  • The authors' affinity loss pushes network predictions apart on two pixels of different ground-truth labels, again regardless of what their actual labels are
  • The collection of such pairwise repulsion help create clear segmentation boundaries.
  • It provides a novel perspective towards structure modeling in deep learning
Summary
  • Introduction:

    Semantic segmentation of an image refers to the challenging task of assigning each pixel a categorical label, e.g., motorcycle or person.
  • Even with big training data and with deeper and more complex network architectures, pixel-wise classification based approaches fundamentally lack the spatial discrimination power when foreground pixels and background pixels are close or mixed together: Segmentation is poor when the visual evidence for the foreground is weak, T.-W.
  • Yu e.g., glass motorcycle shields, or when the spatial structure is small, e.g., thin radial spokes of all the wheels (Fig. 1c)
  • Objectives:

    The authors' size-adaptive affinity field loss is achieved with adversarial learning: Maximizing the affinity loss over different kernel sizes selects the most critical range for imposing pairwise relationships in the label space, and the goal is to minimize this maximal loss – i.e., use the best worst case scenario for most effective training.
  • Methods:

    Methods of Comparison

    The authors briefly describe other popular methods that are used for comparison in the experiments, namely, GAN’s adversarial learning [12], contrastive loss [9], and CRF [15].

    GAN’s Adversarial Learning.
  • The authors briefly describe other popular methods that are used for comparison in the experiments, namely, GAN’s adversarial learning [12], contrastive loss [9], and CRF [15].
  • The authors investigate a popular framework, the Generative Adversarial Networks (GAN) [12].
  • The discriminator D in GAN works as injecting priors for region structures.
  • The adversarial loss is formulated as.
  • The contrastive loss is formulated as
  • Results:

    The authors benchmark the proposed methods on two datasets, PASCAL VOC 2012 [11] and Cityscapes [10].
  • On PASCAL VOC 2012, the training procedure for PSPNet and AAF is the same as follows: The authors first train the networks on train aug and fine-tune on train val.
  • The authors measure the instance-wise mIoU on VOC and Cityscapes validation set as summarized in Table 6 and Table 7, respectively In instance-wise mIoU, the AAF is higher than base architecture by 3.94% on VOC and 2.94% on Cityscapes.
  • Conclusion:

    The authors' affinity loss encourages similar network predictions on two pixels of the same ground-truth label, regardless of what their actual labels are
  • The collection of such pairwise bonds inside a segment ensure that all the pixels achieve the same label.
  • The authors' affinity loss pushes network predictions apart on two pixels of different ground-truth labels, again regardless of what their actual labels are
  • The collection of such pairwise repulsion help create clear segmentation boundaries.
  • It provides a novel perspective towards structure modeling in deep learning
Tables
  • Table1: Key differences between our method and other popular structure modeling approaches, namely CRF [<a class="ref-link" id="c15" href="#r15">15</a>] and GAN [<a class="ref-link" id="c12" href="#r12">12</a>]. The performance (% mIoU) is reported with PSPNet [<a class="ref-link" id="c36" href="#r36">36</a>] architecture on the Cityscapes [<a class="ref-link" id="c10" href="#r10">10</a>] validation set
  • Table2: Per-class results on Pascal VOC 2012 validation set. Gray colored background denotes using FCN as the base architecture
  • Table3: Per-class results on Cityscapes validation set. Gray colored background denotes using FCN as the base architecture
  • Table4: Per-class results on Pascal VOC 2012 testing set
  • Table5: Per-class results on Cityscapes test set
  • Table6: Per-class instance-wise IoU results on Pascal VOC 2012 validation set
  • Table7: Per-class instance-wise IOU results on Cityscapes validation set
  • Table8: Per-class boundary recall results on Pascal VOC 2012 validation set
  • Table9: Per-class boundary recall results on Cityscapes validation set
  • Table10: Per-category IOU results of AAF with different combinations of kernel sizes k on VOC 2012 validation set. ‘ ’ denotes the inclusion of respective kernel size as opposed to ‘×’
  • Table11: Per-class results on GTA5 Part 1
Download tables as Excel
Related work
  • Most methods treat semantic segmentation as a pixel-wise classification task, and those that model structural correlations provide a small gain at a large computational cost.

    Semantic Segmentation. Since the introduction of fully convolutional networks for semantic segmentation [21], deeper [33, 36, 16] and wider [25, 29, 34] network architectures have been explored, drastically improving the performance on benchmarks such as PASCAL VOC [11]. For example, Wu et al [33] achieved higher segmentation accuracy by replacing backbone networks with more powerful ResNet [14], whereas Yu et al [34] tackled fine-detailed segmentation using atrous convolutions. While the performance gain in terms of mIoU is impressive, these pixel-wise classification based approaches fundamentally lack the spatial discrimination power when foreground and background pixels are close or mixed together, resulting in unnatural artifacts in Fig. 1c.

    Structure Modeling. Image segmentation has highly correlated outputs among the pixels. Formulating it as an independent pixel labeling problem not only makes the pixellevel classification unnecessarily hard, but also leads to artifacts and spatially incoherent results. Several ways to incorporate structure information into segmentation have been investigated [15, 8, 37, 19, 17, 4, 24]. For example, Chen et al [6] utilized denseCRF [15] as post-processing to refine the final segmentation results. Zheng et al [37] and Liu et al [19] further made the CRF module differentiable within the deep neural network. Pairwise low-level image cues, such as grouping affinity [23, 18] and contour cues [3, 5], have also been used to encode structures. However, these methods are sensitive to visual appearance changes, or require expensive iterative inference procedures.
Funding
  • With FCN [21] as base architecture, the affinity field loss and AAF improve the performance by 2.16% and 3.04% on VOC and by 1.88% and 2.37% on Cityscapes
  • Loss, embedding contrastive loss, affinity field loss and AAF improve the mean IoU by 0.62%, 1.24%, 1.68% and 2.27% on VOC; affinity field loss and AAF improve by 2.00% and 2.52% on Cityscapes
  • We demonstrate that our proposed AAF achieve 82.17% and 79.07% mIoU, which is better than the PSPNet by 1.54% and 2.77% and competitive to the state-of-the-art performance
  • We measure the instance-wise mIoU on VOC and Cityscapes validation set as summarized in Table 6 and Table 7, respectively In instance-wise mIoU, our AAF is higher than base architecture by 3.94% on VOC and 2.94% on Cityscapes
  • The “bottle” is improved by 12.89% on VOC, “pole” and “tlight” is improved by 9.51% and 9.04% on Cityscapes
  • The overall boundary recall is improved by 7.9% and 8.0% on VOC and Cityscapes, respectively
  • It is shown that without fine-tuning, our proposed AAF outperforms the PSPNet [36] baseline model by 9.5% in mean pixel accuracy and 1.46% in mIoU, which demonstrates the robustness of our proposed methods against appearance variations
Reference
  • Amir, A., Lindenbaum, M.: Grouping-based nonadditive verification. PAMI (1998)
    Google ScholarLocate open access versionFindings
  • Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. TPAMI (2011)
    Google ScholarLocate open access versionFindings
  • Bertasius, G., Shi, J., Torresani, L.: Semantic segmentation with boundary neural fields. In: CVPR (2016)
    Google ScholarFindings
  • Bertasius, G., Torresani, L., Yu, S.X., Shi, J.: Convolutional random walk networks for semantic image segmentation. In: CVPR (2017)
    Google ScholarFindings
  • Chen, L.C., Barron, J.T., Papandreou, G., Murphy, K., Yuille, A.L.: Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In: CVPR (2016)
    Google ScholarFindings
  • Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016)
    Findings
  • Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
    Findings
  • Chen, L.C., Schwing, A., Yuille, A., Urtasun, R.: Learning deep structured models. In: ICML (2015)
    Google ScholarFindings
  • Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: CVPR (2005)
    Google ScholarFindings
  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
    Google ScholarFindings
  • Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. IJCV (2010)
    Google ScholarLocate open access versionFindings
  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS (2014)
    Google ScholarFindings
  • Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV (2011)
    Google ScholarFindings
  • He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
    Google ScholarFindings
  • Krahenbuhl, P., Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. In: NIPS (2011)
    Google ScholarFindings
  • Li, X., Liu, Z., Luo, P., Loy, C.C., Tang, X.: Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In: CVPR (2017)
    Google ScholarFindings
  • Lin, G., Shen, C., van den Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: CVPR (2016)
    Google ScholarFindings
  • Liu, S., De Mello, S., Gu, J., Zhong, G., Yang, M.H., Kautz, J.: Learning affinity via spatial propagation networks. In: NIPS (2017)
    Google ScholarFindings
  • Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: CVPR (2015)
    Google ScholarFindings
  • Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Deep learning markov random field for semantic segmentation. TPAMI (2017)
    Google ScholarLocate open access versionFindings
  • Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
    Google ScholarFindings
  • Luc, P., Couprie, C., Chintala, S., Verbeek, J.: Semantic segmentation using adversarial networks. NIPS Workshop (2016)
    Google ScholarLocate open access versionFindings
  • Maire, M., Narihira, T., Yu, S.X.: Affinity cnn: Learning pixel-centric pairwise relations for figure/ground embedding. In: CVPR (2016)
    Google ScholarFindings
  • Mostajabi, M., Maire, M., Shakhnarovich, G.: Regularizing deep networks by modeling and predicting label structure. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5629–5638 (2018)
    Google ScholarLocate open access versionFindings
  • Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: CVPR (2015)
    Google ScholarFindings
  • Poggio, T.: Early vision: From computational structure to algorithms and parallel hardware. Computer Vision, Graphics, and Image Processing 31(2), 139–155 (1985)
    Google ScholarLocate open access versionFindings
  • Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
    Findings
  • Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truth from computer games. In: European Conference on Computer Vision (ECCV) (2016)
    Google ScholarLocate open access versionFindings
  • Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
    Google ScholarFindings
  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. In: IJCV (2015)
    Google ScholarFindings
  • Shi, J., Malik, J.: Normalized cuts and image segmentation. TPAMI (2000)
    Google ScholarLocate open access versionFindings
  • Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
    Findings
  • Wu, Z., Shen, C., Hengel, A.v.d.: High-performance semantic segmentation using very deep fully convolutional networks. arXiv preprint arXiv:1604.04339 (2016)
    Findings
  • Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)
    Google ScholarFindings
  • Yu, S.X., Shi, J.: Multiclass spectral clustering. In: ICCV (2003)
    Google ScholarFindings
  • Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
    Google ScholarFindings
  • Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: ICCV (2015)
    Google ScholarFindings
Your rating :
0

 

Tags
Comments