Context Prior for Scene Segmentation

    CVPR 2020, 2020.

    Cited by: 1|Bibtex|Views35|Links
    Keywords:
    stochastic gradient descentcontextual dependencycontextual informationsemantic segmentationpixel accuracyMore(11+)
    Wei bo:
    To embed the Context Prior into the network, we present a Context Prior Network, composed of a backbone network and a Context Prior Layer

    Abstract:

    Recent works have widely explored the contextual dependencies to achieve more accurate segmentation results. However, most approaches rarely distinguish different types of contextual dependencies, which may pollute the scene understanding. In this work, we directly supervise the feature aggregation to distinguish the intra-class and int...More

    Code:

    Data:

    0
    Introduction
    • Scene segmentation is a long-standing and challenging problem in computer vision with many downstream applications e.g., augmented reality, autonomous driving [8, 12], human-machine interaction, and video content analysis.
    • Several methods [49, 1, 3, 5] adopt pyramid-based modules or global pooling to aggregate regional or global contextual details regularly
    • They capture the homogeneous contextual relationship, ignoring the contextual dependencies of different categories, as shown in Figure 1(b).
    • Due to the lack of explicit regularization, the relationship description of the attention mechanism is less clear
    • It may select undesirable contextual dependencies, as visualized in Figure 1(e).
    • Both paths aggregate contextual information without explicit distinction, causing a mixture of different contextual relationships
    Highlights
    • Scene segmentation is a long-standing and challenging problem in computer vision with many downstream applications e.g., augmented reality, autonomous driving [8, 12], human-machine interaction, and video content analysis
    • We model the contextual relationships among categories as prior knowledge to obtain more accurate prediction, which is of great importance to the scene segmentation
    • Based on the Context Prior, we propose a Context Prior Network, incorporating a Context Prior Layer with the supervision of an Affinity Loss, as shown in Figure 2
    • Our Context Prior Network achieves 81.3% mean intersection of union on the Cityscapes test set only with the fine dataset, which outperforms the DenseASPP based on DenseNet-161 [17] by 0.9 point
    • To embed the Context Prior into the network, we present a Context Prior Network, composed of a backbone network and a Context Prior Layer
    • Our algorithm achieves 46.3% mean intersection of union on ADE20K, 53.9% mean intersection of union on PASCAL-Context, and 81.3% mean intersection of union on Cityscapes
    • Extensive quantitative and qualitative comparison shows that the proposed Context Prior Network performs favorably against recent state-of-the-art scene segmentation approaches
    Results
    • The authors' algorithm achieves 46.3% mIoU on ADE20K, 53.9% mIoU on PASCAL-Context, and 81.3% mIoU on Cityscapes.
    • The authors' single model achieves 46.3% on the ADE20K validation set, 53.9% on the PASCALContext validation set and 81.3% on the Cityscapes test set.
    • The authors' algorithm achieves 53.9% mIoU on validation set and outperforms state-of-the-art EncNet by over 1.0 point.
    • The authors' CPNet achieves 81.3% mIoU on the Cityscapes test set only with the fine dataset, which outperforms the DenseASPP based on DenseNet-161 [17] by 0.9 point
    Conclusion
    • The authors construct an effective Context Prior for scene segmentation.
    • It distinguishes the different contextual dependencies with the supervision of the proposed Affinity Loss.
    • The Aggregation Module is applied to aggregate spatial information for reasoning the contextual relationship and embedded into the Context Prior Layer.
    • Extensive quantitative and qualitative comparison shows that the proposed CPNet performs favorably against recent state-of-the-art scene segmentation approaches
    Summary
    • Introduction:

      Scene segmentation is a long-standing and challenging problem in computer vision with many downstream applications e.g., augmented reality, autonomous driving [8, 12], human-machine interaction, and video content analysis.
    • Several methods [49, 1, 3, 5] adopt pyramid-based modules or global pooling to aggregate regional or global contextual details regularly
    • They capture the homogeneous contextual relationship, ignoring the contextual dependencies of different categories, as shown in Figure 1(b).
    • Due to the lack of explicit regularization, the relationship description of the attention mechanism is less clear
    • It may select undesirable contextual dependencies, as visualized in Figure 1(e).
    • Both paths aggregate contextual information without explicit distinction, causing a mixture of different contextual relationships
    • Results:

      The authors' algorithm achieves 46.3% mIoU on ADE20K, 53.9% mIoU on PASCAL-Context, and 81.3% mIoU on Cityscapes.
    • The authors' single model achieves 46.3% on the ADE20K validation set, 53.9% on the PASCALContext validation set and 81.3% on the Cityscapes test set.
    • The authors' algorithm achieves 53.9% mIoU on validation set and outperforms state-of-the-art EncNet by over 1.0 point.
    • The authors' CPNet achieves 81.3% mIoU on the Cityscapes test set only with the fine dataset, which outperforms the DenseASPP based on DenseNet-161 [17] by 0.9 point
    • Conclusion:

      The authors construct an effective Context Prior for scene segmentation.
    • It distinguishes the different contextual dependencies with the supervision of the proposed Affinity Loss.
    • The Aggregation Module is applied to aggregate spatial information for reasoning the contextual relationship and embedded into the Context Prior Layer.
    • Extensive quantitative and qualitative comparison shows that the proposed CPNet performs favorably against recent state-of-the-art scene segmentation approaches
    Tables
    • Table1: Ablative studies on the ADE20K [<a class="ref-link" id="c52" href="#r52">52</a>] validation set in comparison to other contextual information aggregation approaches. (Notation: Aux auxiliary loss, BCE binary cross entropy loss, AL Affinity Loss, MS multi-scale and flip testing strategy.)
    • Table2: Experimental results (mIoU) w/ or w/o Context Prior based on different kernel sizes. (Notation: k the kernel size of the fully separable convolution, ∆ the improvement of introducing the Context Prior, CP Context Prior.)
    • Table3: Generalization to the PPM and ASPP module. The evaluation metric is mIoU (%). (Notation: PPM pyramid pooling module, ASPP atrous spatial pyramid pooling, CP Context Prior, AM
    • Table4: Quantitative evaluations on the ADE20K validation set. The proposed CPNet performs favorably against state-of-the-art segmentation algorithms
    • Table5: Quantitative evaluations on the PASCAL-Context validation set. The proposed CPNet performs favorably against stateof-the-art segmentation methods. † means the method uses extra dataset
    • Table6: Quantitative evaluations on the Cityscapes test set. The proposed CPNet performs favorably against state-of-the-art segmentation methods. We only list the methods training with merely the fine dataset
    Download tables as Excel
    Related work
    • Context Aggregation. In recent years, various methods have explored contextual information, which is crucial to scene understanding [1, 5, 32, 49, 43, 45, 44, 19, 26, 41]. There are mainly two paths to capture contextual dependencies. 1) PSPNet [49] adopts the pyramid pooling module to partition the feature map into different scale regions. It averages the pixels of each area as the local context of each pixel in this region. Meanwhile, Deeplab [1, 3, 5] methods employ atrous spatial pyramid pooling to sample the different range of pixels as the local context. 2) DANet [11], OCNet [44], and CCNet [18] take advantage of the selfsimilarity manner [37] to aggregate long-range spatial information. Besides, EncNet [45], DFN [43], and ParseNet [27] use global pooling to harvest the global context.
    Funding
    • This work is supported by the National Natural Science Foundation of China (No 61433007 and 61876210)
    Reference
    • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. Proc. International Conference on Learning Representations (ICLR), 2015. 1, 2, 5, 6, 8
      Google ScholarLocate open access versionFindings
    • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv, 2016. 5
      Google ScholarFindings
    • Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv, 2017. 1, 2, 3, 5
      Google ScholarLocate open access versionFindings
    • Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille. Attention to scale: Scale-aware semantic image segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2
      Google ScholarLocate open access versionFindings
    • Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proc. European Conference on Computer Vision (ECCV), pages 801–818, 2018. 1, 2, 5
      Google ScholarLocate open access versionFindings
    • Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. A 2-nets: Double attention networks. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 352–361, 2018. 2
      Google ScholarLocate open access versionFindings
    • Francois Chollet. Xception: Deep learning with depthwise separable convolutions. 2012, 5
      Google ScholarFindings
    • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 1, 5, 8
      Google ScholarLocate open access versionFindings
    • Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 1635–1643, 2015. 8
      Google ScholarLocate open access versionFindings
    • Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, and Gang Wang. Context contrasted feature and gated multiscale aggregation for scene segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2393–2402, 2018. 8
      Google ScholarLocate open access versionFindings
    • Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2, 6, 8
      Google ScholarLocate open access versionFindings
    • Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 201
      Google ScholarLocate open access versionFindings
    • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 5, 6
      Google ScholarLocate open access versionFindings
    • Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv, 2017. 5
      Google ScholarFindings
    • Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2
      Google ScholarLocate open access versionFindings
    • Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2
      Google ScholarLocate open access versionFindings
    • Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 208
      Google ScholarLocate open access versionFindings
    • Zilong Huang, Xinggang Wang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. Proc. IEEE International Conference on Computer Vision (ICCV), 2019. 2
      Google ScholarLocate open access versionFindings
    • Wei-Chih Hung, Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming-Hsuan Yang. Scene parsing with global context embedding. In Proc. IEEE International Conference on Computer Vision (ICCV), 2017. 1, 2
      Google ScholarLocate open access versionFindings
    • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. International Conference on Machine Learning (ICML), pages 448–456, 2015. 4
      Google ScholarLocate open access versionFindings
    • Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, and Stella X Yu. Adaptive affinity fields for semantic segmentation. In Proc. European Conference on Computer Vision (ECCV), pages 587–602, 2018. 8
      Google ScholarLocate open access versionFindings
    • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2012. 5
      Google ScholarLocate open access versionFindings
    • Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang. Pyramid attention network for semantic segmentation. Proc. the British Machine Vision Conference (BMVC), 2018. 2
      Google ScholarLocate open access versionFindings
    • Xiaodan Liang, Hongfei Zhou, and Eric P. Xing. Dynamicstructured semantic propagation network. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 752–761, 2018. 8
      Google ScholarLocate open access versionFindings
    • Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks with identity mappings for high-resolution semantic segmentation. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 8
      Google ScholarLocate open access versionFindings
    • Huanyu Liu, Chao Peng, Changqian Yu, Jingbo Wang, Xu Liu, Gang Yu, and Wei Jiang. An end-to-end network for panoptic segmentation. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6165–6174, 2019. 2
      Google ScholarLocate open access versionFindings
    • Wei Liu, Andrew Rabinovich, and Alexander C Berg. Parsenet: Looking wider to see better. arXiv, 2016. 2, 3
      Google ScholarFindings
    • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 6, 8
      Google ScholarLocate open access versionFindings
    • Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proc. European Conference on Computer Vision (ECCV), pages 116–131, 2018. 2
      Google ScholarLocate open access versionFindings
    • Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 5, 8
      Google ScholarLocate open access versionFindings
    • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In Neural Information Processing Systems Workshops, 2017. 5
      Google ScholarLocate open access versionFindings
    • Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmentation by global convolutional network. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 1, 2, 3, 5, 8
      Google ScholarLocate open access versionFindings
    • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016. 5
      Google ScholarLocate open access versionFindings
    • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2017. 2
      Google ScholarLocate open access versionFindings
    • Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep high-resolution representation learning for visual recognition. 2019. 1, 5
      Google ScholarFindings
    • Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. Understanding convolution for semantic segmentation. Proc. IEEE Winter Conference on Applications of Computer Vision (WACV), 2018. 2, 8
      Google ScholarLocate open access versionFindings
    • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2, 6
      Google ScholarLocate open access versionFindings
    • Zifeng Wu, Chunhua Shen, and Anton van den Hengel. High-performance semantic segmentation using very deep fully convolutional networks. arXiv, 2016. 5
      Google ScholarFindings
    • Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Proc. European Conference on Computer Vision (ECCV), pages 418–434, 2018. 8
      Google ScholarLocate open access versionFindings
    • Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3684–3692, 2018. 8
      Google ScholarLocate open access versionFindings
    • Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. Enforcing geometric constraints of virtual normal for depth prediction. Proc. IEEE International Conference on Computer Vision (ICCV), pages 5683–5692, 2019. 2
      Google ScholarLocate open access versionFindings
    • Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proc. European Conference on Computer Vision (ECCV), pages 325– 341, 2018. 2, 5, 8
      Google ScholarLocate open access versionFindings
    • Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Learning a discriminative feature network for semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 1, 2, 3, 5, 8
      Google ScholarLocate open access versionFindings
    • Yuhui Yuan and Jingdong Wang. Ocnet: Object context network for scene parsing. arXiv, 2018. 2, 5, 6
      Google ScholarLocate open access versionFindings
    • Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7151–7160, 2018. 1, 2, 3, 5, 6, 8
      Google ScholarLocate open access versionFindings
    • Hang Zhang, Han Zhang, Chenguang Wang, and Junyuan Xie. Co-occurrent features in semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 548–557, 2019. 8
      Google ScholarLocate open access versionFindings
    • Rui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng Yan. Scale-adaptive convolutions for scene parsing. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 2031–2039, 2017. 7, 8
      Google ScholarLocate open access versionFindings
    • Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6848– 6856, 2018. 2, 5
      Google ScholarLocate open access versionFindings
    • Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 1, 2, 3, 5, 6, 7, 8
      Google ScholarLocate open access versionFindings
    • Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. PSANet: Pointwise spatial attention network for scene parsing. In Proc. European Conference on Computer Vision (ECCV), 2018. 1, 2, 5, 6, 7, 8
      Google ScholarLocate open access versionFindings
    • Shuai Zheng, Sadeep Jayasumana, Bernardino RomeraParedes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. In Proc. IEEE International Conference on Computer Vision (ICCV), 2015. 8
      Google ScholarLocate open access versionFindings
    • Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 5, 6
      Google ScholarLocate open access versionFindings
    • Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xiang Bai. Asymmetric non-local neural networks for semantic segmentation. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 593–602, 2019. 8
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments