Feature Pyramid Transformer

Cited by: 1|Bibtex|Views125|Links
Keywords:
visual recognitionTransformerdifferent scalenon localMixture of SoftmaxesMore(31+)
Weibo:
Our Feature Pyramid Transformer does not change the size of the feature pyramid, and is generic and easy to plug-and-play with modern deep networks

Abstract:

Feature interactions across space and scales underpin modern visual recognition systems because they introduce beneficial visual contexts. Conventionally, spatial contexts are passively hidden in the CNN's increasing receptive fields or actively encoded by non-local convolution. Yet, the non-local spatial interactions are not across sca...More
0
Introduction
  • Modern visual recognition systems stand in context. Thanks to the hierarchical structure of Convolutional Neural Network (CNN), as illustrated in Fig. 1 (a), contexts are encoded in the gradually larger receptive fields (the green dashed rectangles) by pooling [1,2], stride [3] or dilated convolution [4].
  • Objects of different scales are recognized in their corresponding levels, e.g., mouse in lower levels and table in higher levels.
  • CNN offers an in-network feature pyramid [8], i.e., lower/higher-level feature maps represent higher/lowerresolution visual content without computational overhead [9,10].
  • As shown in Fig. 1 (b), the authors can recognize objects of different scales by using feature maps of different levels, i.e., small objects are recognized in lower-levels and large objects are recognized in higher-levels [11,12,13]
Highlights
  • Modern visual recognition systems stand in context
  • We propose a novel feature pyramid network called Feature Pyramid Transformer (FPT) for visual recognition, such as instance-level and pixel-level segmentation tasks
  • This paper focuses on two instance-level tasks: object detection, instance segmentation, and one pixel-level task: semantic segmentation
  • We proposed an efficient feature interaction approach called FPT, composed of three carefully-designed transformers to respectively encode the explicit selflevel, top-down and bottom-up information in the feature pyramid
  • Our FPT does not change the size of the feature pyramid, and is generic and easy to plug-and-play with modern deep networks
  • Our extensive quantitative and qualitative results on three challenging visual recognition tasks showed that FPT achieves consistent improvements over the baselines and the state-of-the-arts, validating its high effectiveness and strong application capability
Methods
  • Backbone AP

    BFP+FPN [12] BFP+BPA [13] BFP+BFI [31]

    ResNet-101 36.2 35.7 59.1 58.0 39.0 37.8 18.2 15.5 39.0 38.1 52.4 49.2 ResNet-101 37.3 36.3 60.4 59.0 39.9 38.3 18.9 16.3 39.7 39.0 53.0 50.5 ResNet-101 39.5 - - - - - - - - - - -

    BFP+FPT [AH] BFP+FPT [MT]

    BFP+FPN [12] [all] BFP+BPA [13] [all] BFP+BFI [31] [all] BFP+FPT [all]

    has 150 classes, and uses 20k, 2k, and 3k images for training, validation and test, respectively; (3) LIP [25] contains 50, 462 images with classes, and includes 30, 462, 10k and 10k images for training, validation and test, respectively; (4) PASCAL VOC 2012 [23] contains classes, and includes 1, 464, 1, 449 and 1, 456 images for training, validation and test, respectively.

    Backbone.
  • BFP+FPN [12] [all] BFP+BPA [13] [all] BFP+BFI [31] [all] BFP+FPT [all].
  • The authors cropped the image into 969 × 969 for Cityscapes, 573 × 573 for LIP, and 521 × 521 for PASCAL VOC 2012.
  • The authors' FPT was applied to the feature pyramids constructed by three methods: UFP [29], PPM [1,15] and ASPP [14].
  • LIP PASCAL VOC 2012 baseline ResNet-101.
  • PPM [15]+FPT ResNet-101 80.4(↑ 0.5) 44.8(↑ 1.1) 54.2(↑ 1.2)
  • ASPP [14]+OC [32] ResNet-101
Results
  • The full combination of ST, GT and RT results the best performance, i.e., 38.0% bounding box AP (6.4% higher than BFP) on object detection and 36.8% mask AP (6.9% higher than BFP) on instance segmentation.
  • The authors' best model achieves 1.4% and 2.6% improvements for Tr.mIoU and Val.mIoU, respectively.
  • From Table 5, the authors can observe that the FPT can achieve a new state-of-the-art performance over all the previous methods based on the same backbone (i.e., ResNet-101)
Conclusion
  • The authors proposed an efficient feature interaction approach called FPT, composed of three carefully-designed transformers to respectively encode the explicit selflevel, top-down and bottom-up information in the feature pyramid.
  • The authors' FPT does not change the size of the feature pyramid, and is generic and easy to plug-and-play with modern deep networks.
  • The authors' extensive quantitative and qualitative results on three challenging visual recognition tasks showed that FPT achieves consistent improvements over the baselines and the state-of-the-arts, validating its high effectiveness and strong application capability
Summary
  • Introduction:

    Modern visual recognition systems stand in context. Thanks to the hierarchical structure of Convolutional Neural Network (CNN), as illustrated in Fig. 1 (a), contexts are encoded in the gradually larger receptive fields (the green dashed rectangles) by pooling [1,2], stride [3] or dilated convolution [4].
  • Objects of different scales are recognized in their corresponding levels, e.g., mouse in lower levels and table in higher levels.
  • CNN offers an in-network feature pyramid [8], i.e., lower/higher-level feature maps represent higher/lowerresolution visual content without computational overhead [9,10].
  • As shown in Fig. 1 (b), the authors can recognize objects of different scales by using feature maps of different levels, i.e., small objects are recognized in lower-levels and large objects are recognized in higher-levels [11,12,13]
  • Objectives:

    The authors aim to conduct the non-local interaction per se in the corresponding scales of the interacted objects.
  • Methods:

    Backbone AP

    BFP+FPN [12] BFP+BPA [13] BFP+BFI [31]

    ResNet-101 36.2 35.7 59.1 58.0 39.0 37.8 18.2 15.5 39.0 38.1 52.4 49.2 ResNet-101 37.3 36.3 60.4 59.0 39.9 38.3 18.9 16.3 39.7 39.0 53.0 50.5 ResNet-101 39.5 - - - - - - - - - - -

    BFP+FPT [AH] BFP+FPT [MT]

    BFP+FPN [12] [all] BFP+BPA [13] [all] BFP+BFI [31] [all] BFP+FPT [all]

    has 150 classes, and uses 20k, 2k, and 3k images for training, validation and test, respectively; (3) LIP [25] contains 50, 462 images with classes, and includes 30, 462, 10k and 10k images for training, validation and test, respectively; (4) PASCAL VOC 2012 [23] contains classes, and includes 1, 464, 1, 449 and 1, 456 images for training, validation and test, respectively.

    Backbone.
  • BFP+FPN [12] [all] BFP+BPA [13] [all] BFP+BFI [31] [all] BFP+FPT [all].
  • The authors cropped the image into 969 × 969 for Cityscapes, 573 × 573 for LIP, and 521 × 521 for PASCAL VOC 2012.
  • The authors' FPT was applied to the feature pyramids constructed by three methods: UFP [29], PPM [1,15] and ASPP [14].
  • LIP PASCAL VOC 2012 baseline ResNet-101.
  • PPM [15]+FPT ResNet-101 80.4(↑ 0.5) 44.8(↑ 1.1) 54.2(↑ 1.2)
  • ASPP [14]+OC [32] ResNet-101
  • Results:

    The full combination of ST, GT and RT results the best performance, i.e., 38.0% bounding box AP (6.4% higher than BFP) on object detection and 36.8% mask AP (6.9% higher than BFP) on instance segmentation.
  • The authors' best model achieves 1.4% and 2.6% improvements for Tr.mIoU and Val.mIoU, respectively.
  • From Table 5, the authors can observe that the FPT can achieve a new state-of-the-art performance over all the previous methods based on the same backbone (i.e., ResNet-101)
  • Conclusion:

    The authors proposed an efficient feature interaction approach called FPT, composed of three carefully-designed transformers to respectively encode the explicit selflevel, top-down and bottom-up information in the feature pyramid.
  • The authors' FPT does not change the size of the feature pyramid, and is generic and easy to plug-and-play with modern deep networks.
  • The authors' extensive quantitative and qualitative results on three challenging visual recognition tasks showed that FPT achieves consistent improvements over the baselines and the state-of-the-arts, validating its high effectiveness and strong application capability
Tables
  • Table1: Ablation study on MS-COCO 2017 val set [<a class="ref-link" id="c21" href="#r21">21</a>]. “BFP” is Bottom-up Feature Pyramid [<a class="ref-link" id="c12" href="#r12">12</a>]; “ST” is Self-Transformer; “GT” is Grounding Transformer; “RT” is Rendering Transformer. Results on the left and right of the dashed are of bounding box detection and instance segmentation
  • Table2: Ablation study of SBN [<a class="ref-link" id="c41" href="#r41">41</a>] and DropBlock [<a class="ref-link" id="c40" href="#r40">40</a>] on the MS-COCO 2017 val set [<a class="ref-link" id="c21" href="#r21">21</a>]. Results on the left and right of dashed lines are respectively for bounding box detection and instance segmentation
  • Table3: Experimental results on MS-COCO 2017 test-dev [<a class="ref-link" id="c21" href="#r21">21</a>]. “AH” is Augmented Head, and “MT” is Multi-scale Training [<a class="ref-link" id="c13" href="#r13">13</a>]; “all” means that both the AH and MT are used. Results on the left and right of the dashed are of bounding box detection and instance segmentation. “-” means that there is no reported result in its paper
  • Table4: Ablation study on the Cityscapes validation set [<a class="ref-link" id="c22" href="#r22">22</a>]. “LGT” is Localityconstrained Grounding Transformer; “RT” is Rendering Transformer; “ST” is SelfTransformer. “+” means building the method on the top of UFP
  • Table5: Comparisons with state-of-the-art on test sets of Cityscapes [<a class="ref-link" id="c22" href="#r22">22</a>] and PASCAL VOC 2012 [<a class="ref-link" id="c23" href="#r23">23</a>], validation sets of ADE20K [<a class="ref-link" id="c24" href="#r24">24</a>] and LIP [<a class="ref-link" id="c25" href="#r25">25</a>]. Results in this table refer to mIoU; “-” means that there is no reported result in its paper. The best and second best models under each setting are marked with corresponding formats
  • Table6: Comparing Feud with Fsim on validation set of MS-COCO 2017 [<a class="ref-link" id="c21" href="#r21">21</a>]. The backbone is ResNet-50 [<a class="ref-link" id="c42" href="#r42">42</a>]. “BFP” is the bottom-up feature pyramid (BFP) [<a class="ref-link" id="c12" href="#r12">12</a>]. Results on the left and right of the dashed are respectively from bounding box detection and instance segmentation
  • Table7: The influence of N on ST. Experiments are carried out on validation set of MS-COCO 2017 [<a class="ref-link" id="c21" href="#r21">21</a>]. The backbone is ResNet-50 [<a class="ref-link" id="c42" href="#r42">42</a>]. “BFP” is the bottom-up feature pyramid (BFP) [<a class="ref-link" id="c12" href="#r12">12</a>]. “w/o MoS” means that these results are obtained without MoS [<a class="ref-link" id="c34" href="#r34">34</a>]. Results on the left and right of the dashed are of bounding box detection and instance segmentation
  • Table8: The influence of N on GT. Experiments are carried out on validation set of MS-COCO 2017 [<a class="ref-link" id="c21" href="#r21">21</a>]. The backbone is ResNet-50 [<a class="ref-link" id="c42" href="#r42">42</a>]. “BFP” is the bottom-up feature pyramid (BFP) [<a class="ref-link" id="c12" href="#r12">12</a>]. “w/o MoS” means that these results are obtained without MoS [<a class="ref-link" id="c34" href="#r34">34</a>]. Results on the left and right of the dashed are of bounding box detection and instance segmentation
  • Table9: The influence of square size of LGT on the pixel-level semantic segmentation task. The backbone is the dilated ResNet-101 [<a class="ref-link" id="c4" href="#r4">4</a>]. Experiments are carried out on training set and the validation set of Cityscapes [<a class="ref-link" id="c22" href="#r22">22</a>]. “UFP” is the unscathed feature pyramid
  • Table10: The influence of block size and keep prob of DropBlock [<a class="ref-link" id="c40" href="#r40">40</a>] on the instancelevel tasks (i.e., object detection and instance segmentation). The backbone is ResNet50 [<a class="ref-link" id="c42" href="#r42">42</a>]. Results on the left and right of the dashed are AP of bounding box detection and mask AP of instance segmentation
  • Table11: The influence of block size and keep prob of DropBlock [<a class="ref-link" id="c40" href="#r40">40</a>] on the pixellevel semantic segmentation. The backbone is the dilated ResNet-101 [<a class="ref-link" id="c4" href="#r4">4</a>]. Experiments are carried out on validation set of Cityscapes [<a class="ref-link" id="c22" href="#r22">22</a>]. Results in this table refer to the mIoU on the validation set (i.e., Val.mIoU)
  • Table12: Model complexity analysis on validation set of MS-COCO 2017 [<a class="ref-link" id="c21" href="#r21">21</a>] for instance segmentation. The backbone is ResNet-50 [<a class="ref-link" id="c42" href="#r42">42</a>]. “BFP” is the bottom-up feature pyramid (BFP) [<a class="ref-link" id="c12" href="#r12">12</a>]
  • Table13: Combining FPN/BPA/BFI and NL-ResNet/GC-ResNet/AA-ResNet on validation set of MS-COCO 2017 [<a class="ref-link" id="c21" href="#r21">21</a>]. The base is ResNet-50. Results on the left and right of the dashed are respectively from bounding box detection and instance segmentation
  • Table14: Result comparisons on different backbones. Experiments are carried out on validation set of MS-COCO 2017 [<a class="ref-link" id="c21" href="#r21">21</a>]. “BFP” is the bottom-up feature pyramid (BFP) [<a class="ref-link" id="c12" href="#r12">12</a>]. Results on the left and right of the dashed are of bounding box detection and instance segmentation
Download tables as Excel
Related work
  • FPT is generic to apply in a wide range of computer vision tasks. This paper focuses on two instance-level tasks: object detection, instance segmentation, and one pixel-level task: semantic segmentation. Object detection aims to predict a bounding box for each object and then assigns the bounding box a class label [6], while instance segmentation is additionally required to predict a pixel-level mask of the object [26]. Semantic segmentation aims to predict a class label to each pixel of the image [27]. Feature pyramid. The in-network feature pyramid (i.e., the Bottom-up Feature Pyramid (BFP) [12]) is one of the most commonly used methods, and has been shown useful for boosting object detection [9], instance segmentation [13] and semantic segmentation [28]. Another popular method of constructing feature pyramid uses feature maps of the scale while processing the maps through pyramidal pooling or dilated/atrous convolutions. For example, atrous spatial pyramid pooling [14] and pyramid pooling module [1,15] leverages output feature maps of the last convolution layer in the CNN backbone to build the four-level feature pyramid, in which different levels have the same resolution but different information granularities. Our approach is based on the existing BFP (for the instance-level) and unscathed feature pyramid [29] (for the pixel-level). Our contribution is the novel feature interaction approach. Feature interaction. An intuitive approach to the cross-scale feature interaction is gradually summing the multi-scale feature maps, such as Feature Pyramid Network (FPN) [12] and Path Aggregation Network (PANet) [13]. In particular, both FPN and PANet are based on BFP, where FPN adds a top-down path to propagate semantic information into low-level feature maps, and PANet adds a bottom-up path augmentation on the basis of FPN. Another approach is to concatenate multi-scale feature maps along the channel dimension. The specific examples for semantic segmentation are DeepLab [30] and pyramid scene parsing network [15]. Besides, a more recent work proposed the ZigZagNet [31] which exploits the addition and convolution to enhance the cross-scale feature interaction. Particularly, for the within-scale feature interaction, some recent works exploited non-local operation [16] and self-attention [17] to capture the co-occurrent object features in the same scene. Their models were evaluated in a wide range of visual tasks [11,32,19,33]. However, we argue that the non-local interaction performed in just one uniform scale feature map is not enough to represent the contexts. In this work, we aim to conduct the non-local interaction per se in the corresponding scales of the interacted objects (or parts).
Funding
  • This work was partially supported by the National Key Research and Development Program of China under Grant 2018AAA0102002, the National Natural Science Foundation of China under Grant 61925204, the China Scholarships Council under Grant 201806840058, the Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 1 grant, and the NTU-Alibaba JRI
Study subjects and analysis
datasets: 4
It obtains impressive improvements as 1.6%, 1.2%, 1.7% and 1.8% mIoU on Cityscapes [22], ADE20K [24], LIP [25] and PASCAL VOC 2012 [23], respectively. Besides, compared to OCNet, FPT obtains gain by 0.9%, 1.1%, 1.3% and 0.8% mIoU in these four datasets on average. In Fig. 5, we provide the qualitative results of our method6

datasets: 4
It obtains impressive improvements as 1.6%, 1.2%, 1.7% and 1.8% mIoU on Cityscapes [22], ADE20K [24], LIP [25] and PASCAL VOC 2012 [23], respectively. Besides, compared to OCNet, FPT obtains gain by 0.9%, 1.1%, 1.3% and 0.8% mIoU in these four datasets on average. In Fig. 5, we provide the qualitative results of our method6

Reference
  • He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI 37(9) (2015) 1904–1916
    Google ScholarLocate open access versionFindings
  • Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR. (2006)
    Google ScholarFindings
  • Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: The all convolutional net. In: ICLR. (2015)
    Google ScholarFindings
  • Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR. (2016)
    Google ScholarFindings
  • Girshick, R.: Fast r-cnn. In: ICCV. (2015)
    Google ScholarFindings
  • Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: NeurIPS. (2015)
    Google ScholarFindings
  • Adelson, E.H., Anderson, C.H., Bergen, J.R., Burt, P.J., Ogden, J.M.: Pyramid methods in image processing. RCA Engineer 29(6) (1984) 33–41 8. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV. (2014)
    Google ScholarLocate open access versionFindings
  • 9. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: ECCV. (2016)
    Google ScholarFindings
  • 10. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR. (2016)
    Google ScholarFindings
  • 11. Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: CVPR. (2018)
    Google ScholarFindings
  • 12. Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR. (2017)
    Google ScholarFindings
  • 13. Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: CVPR. (2018)
    Google ScholarFindings
  • 14. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. In: arXiv. (2017)
    Google ScholarFindings
  • 15. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR. (2017)
    Google ScholarFindings
  • 16. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NeurIPS. (2017)
    Google ScholarFindings
  • 18. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: Endto-end object detection with transformers. In: ECCV. (2020)
    Google ScholarFindings
  • 19. Zhang, H., Zhang, H., Wang, C., Xie, J.: Co-occurrent features in semantic segmentation. In: CVPR. (2019)
    Google ScholarFindings
  • 20. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al.: In-datacenter performance analysis of a tensor processing unit. In: ISCA. (2017)
    Google ScholarFindings
  • 21. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. (2014)
    Google ScholarLocate open access versionFindings
  • 22. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR. (2016)
    Google ScholarFindings
  • 23. Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1) (2015) 98–136
    Google ScholarLocate open access versionFindings
  • 24. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR. (2017)
    Google ScholarFindings
  • 25. Gong, K., Liang, X., Zhang, D., Shen, X., Lin, L.: Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In: CVPR. (2017)
    Google ScholarFindings
  • 26. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: ICCV. (2017)
    Google ScholarFindings
  • 27. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. (2015)
    Google ScholarFindings
  • 28. Zhang, Z., Zhang, X., Peng, C., Xue, X., Sun, J.: Exfuse: Enhancing feature fusion for semantic segmentation. In: ECCV. (2018)
    Google ScholarFindings
  • 29. Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve semantic segmentation by global convolutional network. In: CVPR. (2017)
    Google ScholarFindings
  • 30. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40(4) (2017) 834–848 31. Lin, D., Shen, D., Shen, S., Ji, Y., Lischinski, D., Cohen-Or, D., Huang, H.: Zigzagnet: Fusing top-down and bottom-up context for object segmentation. In: CVPR. (2019)
    Google ScholarLocate open access versionFindings
  • 32. Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. In: arXiv. (2018)
    Google ScholarFindings
  • 33. Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X.: Asymmetric non-local neural networks for semantic segmentation. In: ICCV. (2019)
    Google ScholarFindings
  • 34. Yang, Z., Dai, Z., Salakhutdinov, R., Cohen, W.W.: Breaking the softmax bottleneck: A high-rank rnn language model. In: ICLR. (2018)
    Google ScholarFindings
  • 35. Zhang, Y., Hare, J., Prugel-Bennett, A.: Learning to count objects in natural images for visual question answering. In: ICLR. (2018)
    Google ScholarFindings
  • 36. Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T.S.: Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR. (2017)
    Google ScholarFindings
  • 37. Lin, M., Chen, Q., Yan, S.: Network in network. In: ICLR. (2014)
    Google ScholarFindings
  • 38. Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: CVPR. (2018)
    Google ScholarFindings
  • 39. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. (2009)
    Google ScholarFindings
  • 40. Ghiasi, G., Lin, T.Y., Le, Q.V.: Dropblock: A regularization method for convolutional networks. In: NeurIPS. (2018)
    Google ScholarFindings
  • 41. Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context encoding for semantic segmentation. In: CVPR. (2018)
    Google ScholarFindings
  • 42. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016)
    Google ScholarFindings
  • 43. Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Gcnet: Non-local networks meet squeezeexcitation networks and beyond. In: ICCV. (2019)
    Google ScholarFindings
  • 44. Irwan, B., Barret, Z., Ashish, V., Jonathon, S., Quoc, V.L.: Attention augmented convolutional networks. In: ICCV. (2019)
    Google ScholarFindings
  • 45. Zhou, Y., Zhu, Y., Ye, Q., Qiu, Q., Jiao, J.: Weakly supervised instance segmentation using class peak response. In: CVPR. (2018)
    Google ScholarFindings
  • 46. Zhu, L., Wang, T., Aksu, E., Kamarainen, J.K.: Portrait instance segmentation for mobile devices. In: ICME. (2019)
    Google ScholarFindings
  • 47. Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., Mu, Y., Wang, X., Liu, W., Wang, J.: High-resolution representations for labeling pixels and regions. In: arXiv. (2019)
    Google ScholarFindings
  • 48. Takikawa, T., Acuna, D., Jampani, V., Fidler, S.: Gated-scnn: Gated shape cnns for semantic segmentation. In: ICCV. (2019)
    Google ScholarFindings
Your rating :
0

 

Tags
Comments