Learning a Discriminative Feature Network for Semantic Segmentation

    CVPR, 2018.

    Cited by: 222|Bibtex|Views23|Links
    EI
    Keywords:
    border networksemantic labelsemantic segmentationstochastic gradient descentneural networkMore(10+)
    Wei bo:
    We redefine the semantic segmentation from a macroscopic view of point, regarding it as a task to assign a consistent semantic label to one category of objects, rather than to each single pixel. This task requires the intra-class consistency and inter-class distinction. Aiming to...

    Abstract:

    Most existing methods of semantic segmentation still suffer from two aspects of challenges: intra-class inconsistency and inter-class indistinction. To tackle these two problems, we propose a Discriminative Feature Network (DFN), which contains two sub-networks: Smooth Network and Border Network. Specifically, to handle the intra-class in...More

    Code:

    Data:

    0
    Introduction
    • Semantic segmentation is a fundamental technique for numerous computer vision applications like scene understanding, human parsing and autonomous driving.
    • The features learned by these methods are usually not discriminative to differentiate 1) the patches which share the same semantic label but different appearances, named intra-class inconsistency as shown in the first row of Figure 1; 2) the two adjacent patches which have different semantic labels but with similar appearances, named inter-class indistinction as shown in the second row of Figure 1
    • To address these two challenges, the authors rethink the semantic segmentation task from a more macroscopic point of view.
    • The authors present a novel Discriminative Feature Network (DFN) to learn the feature representation which considers both the “intra-class consistency” and the “inter-class distinction”
    Highlights
    • Semantic segmentation is a fundamental technique for numerous computer vision applications like scene understanding, human parsing and autonomous driving
    • We propose a Channel Attention Block (CAB), which utilizes the high-level features to guide the selection of lowlevel features stage-by-stage
    • We regard the semantic segmentation as a task to assign a consistent semantic label to one category of things, not just at the pixel level. We propose a Discriminative Feature Network to simultaneously address the “intra-class consistency” and “inter-class variation” issues
    • Experiments on PASCAL VOC 2012 and Cityscapes datasets validate the effectiveness of our proposed algorithm. We present a Smooth Network to enhance the intraclass consistency with the global context and the Channel Attention Block. We design a bottom-up Border Network with deep supervision to enlarge the variation of features on both sides of the semantic boundary
    • With Smooth Network and Border Network, we propose our Discriminative Feature Network for semantic segmentation as illustrated in Figure 2 (a)
    • We redefine the semantic segmentation from a macroscopic view of point, regarding it as a task to assign a consistent semantic label to one category of objects, rather than to each single pixel. This task requires the intra-class consistency and inter-class distinction. Aiming to consider both sides, we propose a Discriminative Feature Network, which contains two sub-networks: Smooth Network and Border Network
    Methods
    • The authors first detailedly introduce the proposed Discriminative Feature Network containing Smooth Network and Border Network.
    • The authors extend the base network to FCN4 structure [27, 36] with the proposed Refinement Residual Block (RRB), which improves the performance from 72.86% to 76.65%, as Table 2 shows.
    • The authors integrate the Border Network into the Smooth Network
    • This improves the performance from 79.54% to 79.67%, as shown in Table 3.
    • Figure 6 shows the predicted semantic boundary of Border Network.
    Results
    • The authors evaluate the approach on two public datasets: PASCAL VOC 2012 [9] and Cityscapes [8].
    • The authors evaluate each component of the proposed method, and analyze the results in detail.
    • PASCAL VOC 2012: The PASCAL VOC 2012 is a wellknown semantic segmentation benchmark which contains 20 object classes and one background, involving 1,464 images for training, 14,449 images for validation and 1,456 images for testing.
    • The original dataset is augmented by the Semantic Boundaries Dataset [12], resulting in 10,582 images for training
    Conclusion
    • The authors redefine the semantic segmentation from a macroscopic view of point, regarding it as a task to assign a consistent semantic label to one category of objects, rather than to each single pixel.
    • This task requires the intra-class consistency and inter-class distinction.
    • The authors' experimental results show that the proposed approach can significantly improve the performance on the PASCAL VOC 2012 and Cityscapes benchmarks
    Summary
    • Introduction:

      Semantic segmentation is a fundamental technique for numerous computer vision applications like scene understanding, human parsing and autonomous driving.
    • The features learned by these methods are usually not discriminative to differentiate 1) the patches which share the same semantic label but different appearances, named intra-class inconsistency as shown in the first row of Figure 1; 2) the two adjacent patches which have different semantic labels but with similar appearances, named inter-class indistinction as shown in the second row of Figure 1
    • To address these two challenges, the authors rethink the semantic segmentation task from a more macroscopic point of view.
    • The authors present a novel Discriminative Feature Network (DFN) to learn the feature representation which considers both the “intra-class consistency” and the “inter-class distinction”
    • Methods:

      The authors first detailedly introduce the proposed Discriminative Feature Network containing Smooth Network and Border Network.
    • The authors extend the base network to FCN4 structure [27, 36] with the proposed Refinement Residual Block (RRB), which improves the performance from 72.86% to 76.65%, as Table 2 shows.
    • The authors integrate the Border Network into the Smooth Network
    • This improves the performance from 79.54% to 79.67%, as shown in Table 3.
    • Figure 6 shows the predicted semantic boundary of Border Network.
    • Results:

      The authors evaluate the approach on two public datasets: PASCAL VOC 2012 [9] and Cityscapes [8].
    • The authors evaluate each component of the proposed method, and analyze the results in detail.
    • PASCAL VOC 2012: The PASCAL VOC 2012 is a wellknown semantic segmentation benchmark which contains 20 object classes and one background, involving 1,464 images for training, 14,449 images for validation and 1,456 images for testing.
    • The original dataset is augmented by the Semantic Boundaries Dataset [12], resulting in 10,582 images for training
    • Conclusion:

      The authors redefine the semantic segmentation from a macroscopic view of point, regarding it as a task to assign a consistent semantic label to one category of objects, rather than to each single pixel.
    • This task requires the intra-class consistency and inter-class distinction.
    • The authors' experimental results show that the proposed approach can significantly improve the performance on the PASCAL VOC 2012 and Cityscapes benchmarks
    Tables
    • Table1: The performance of ResNet-101 with and without random scale augmentation
    • Table2: Detailed performance comparison of our proposed
    • Table3: Combining the Border Network and Smooth Network as Discriminative Feature Network. SN: Smooth Network. BN: Border Network. MS Flip: Adding multi-scale inputs and left-right flipped inputs
    • Table4: Validation strategy on PASCAL VOC 2012 dataset. MS Flip: Multi-scale and flip evaluation
    • Table5: Performance on PASCAL VOC 2012 test set. Methods pre-trained on MS-COCO are marked with +
    • Table6: Performance on Cityscapes test set. The “-” indicates that the method do not present this result in its paper
    Download tables as Excel
    Related work
    • Recently, lots of approaches based on FCN have achieved high performance on different benchmarks [42, 9, 8]. Most of them are still constrained by intra-class inconsistency and inter-class indistinction issues.

      Encoder-Decoder: The FCN model has inherently encoded different levels of feature. Naturally, some methods integrate them to refine the final prediction. This branch of methods mainly consider how to recover the reduced spatial information caused by consecutive pooling operator or convolution with stride. For example, SegNet [1] utilizes the saved pool indices to recover the reduced spatial information. U-net [31] uses the skip connection, while the Global Convolutional Network [30] adapts the large kernel size. Besides, LRR [11] adds the Laplacian Pyramid Reconstruction network, while RefineNet [19] utilizes multipath refinement network. However, this type of architecture ignores the global context. In addition, most methods of this type are just summed up the features of adjacent stages without consideration of their diverse representation. This leads to some inconsistent results.
    Funding
    • This work has been supported by the Project of the National Natural Science Foundation of China No.61433007 and No.61401170
    Reference
    • V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, 2017. 2
      Google ScholarLocate open access versionFindings
    • J. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679–698, June 1986. 5
      Google ScholarLocate open access versionFindings
    • L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, and T.-S. Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2
      Google ScholarLocate open access versionFindings
    • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference on Learning Representations, 2015. 8
      Google ScholarLocate open access versionFindings
    • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv, 2016. 2, 3, 5, 8
      Google ScholarLocate open access versionFindings
    • L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv, 2017. 1, 2, 3, 8
      Google ScholarLocate open access versionFindings
    • L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2
      Google ScholarLocate open access versionFindings
    • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2, 5, 8
      Google ScholarLocate open access versionFindings
    • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.2, 5
      Locate open access versionFindings
    • S. Y. Falong Shen, Gan Rui and G. Zeng. Semantic segmentation via structured patch prediction, context crf and guidance crf. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 8
      Google ScholarLocate open access versionFindings
    • G. Ghiasi and C. C. Fowlkes. Laplacian pyramid reconstruction and refinement for semantic segmentation. In European Conference on Computer Vision, 2016. 2, 8
      Google ScholarLocate open access versionFindings
    • B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In IEEE International Conference on Computer Vision. IEEE, 2011. 5, 8
      Google ScholarLocate open access versionFindings
    • K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, 2014. 2, 3
      Google ScholarLocate open access versionFindings
    • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. 3, 4, 5
      Google ScholarLocate open access versionFindings
    • K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, 2016. 4
      Google ScholarLocate open access versionFindings
    • J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv, 2017. 2
      Google ScholarFindings
    • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, 2012. 5
      Google ScholarLocate open access versionFindings
    • X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang. Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 8
      Google ScholarLocate open access versionFindings
    • G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multipath refinement networks with identity mappings for highresolution semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1, 2, 3, 8
      Google ScholarLocate open access versionFindings
    • G. Lin, C. Shen, A. van den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. 8
      Google ScholarLocate open access versionFindings
    • M. Lin, Q. Chen, and S. Yan. Network in network. In International Conference on Learning Representations, 2014. 2, 3
      Google ScholarLocate open access versionFindings
    • T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. In IEEE International Conference on Computer Vision, 2017. 5
      Google ScholarLocate open access versionFindings
    • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer, 2014. 8
      Google ScholarLocate open access versionFindings
    • W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. In International Conference on Learning Representations, 2016. 2, 3, 5, 8
      Google ScholarLocate open access versionFindings
    • Y. Liu, M.-M. Cheng, X. Hu, K. Wang, and X. Bai. Richer convolutional features for edge detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2
      Google ScholarLocate open access versionFindings
    • Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In IEEE International Conference on Computer Vision, 2015. 8
      Google ScholarLocate open access versionFindings
    • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015. 1, 3, 5, 6, 8
      Google ScholarLocate open access versionFindings
    • V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In Neural Information Processing Systems, 2014. 2
      Google ScholarLocate open access versionFindings
    • M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feedforward semantic segmentation with zoom-out features. In IEEE Conference on Computer Vision and Pattern Recognition, 2015. 8
      Google ScholarLocate open access versionFindings
    • C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1, 2, 3, 8
      Google ScholarLocate open access versionFindings
    • O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015. 2
      Google ScholarLocate open access versionFindings
    • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211–252, 2015. 5
      Google ScholarLocate open access versionFindings
    • F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2
      Google ScholarLocate open access versionFindings
    • P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell. Understanding convolution for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 8
      Google ScholarLocate open access versionFindings
    • Z. Wu, C. Shen, and A. v. d. Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. arXiv, 2016. 8
      Google ScholarLocate open access versionFindings
    • S. Xie and Z. Tu. Holistically-nested edge detection. In IEEE International Conference on Computer Vision, 2015. 2, 3, 5, 6
      Google ScholarLocate open access versionFindings
    • J. Yang, B. Price, S. Cohen, H. Lee, and M.-H. Yang. Object contour detection with a fully convolutional encoder-decoder network. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2
      Google ScholarLocate open access versionFindings
    • F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations, 2016. 2
      Google ScholarLocate open access versionFindings
    • Z. Yu, C. Feng, M.-Y. Liu, and S. Ramalingam. Casenet: Deep category-aware semantic edge detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2
      Google ScholarLocate open access versionFindings
    • H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1, 2, 3, 8
      Google ScholarLocate open access versionFindings
    • S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as recurrent neural networks. In IEEE International Conference on Computer Vision, 2015. 8
      Google ScholarLocate open access versionFindings
    • B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments