Distilling Localization for Self-Supervised Representation Learning

Zhao Nanxuan
Zhao Nanxuan
Lau Rynson W. H.
Lau Rynson W. H.
Cited by: 0|Bibtex|Views88|Links
Keywords:
foreground objectproxy taskconvolutional neural networklarge scaleTransfer LearningMore(8+)
Weibo:
We identified a strong error pattern among self-supervised models in their failure to localize foreground objects

Abstract:

For high-level visual recognition, self-supervised learning defines and makes use of proxy tasks such as colorization and visual tracking to learn a semantic representation useful for distinguishing objects. In this paper, through visualizing and diagnosing classification errors, we observe that current self-supervised models are ineffe...More

Code:

Data:

Introduction
  • Visual recognition has been revolutionized by deep learning in the fashion of assembling considerable amounts of labeled data [7] and training very deep neural networks [25].
  • RBD improves the performance by 2.8% and the saliency network BASNet by 6.4%.
  • In Table 2, the authors find a correlation between the saliency performance on the saliency benchmark and the self-supervised representation learning.
  • Texture backgrounds improve the performance very marginally.
  • This is possibly because textured images in the dataset [3] are outside of the ImageNet distribution.
  • With only 20% to 50% of images receiving copy-and-pastes, the authors significantly improve the performance by 3%−6%.
Highlights
  • Visual recognition has been revolutionized by deep learning in the fashion of assembling considerable amounts of labeled data [7] and training very deep neural networks [25]
  • We find that current self-supervised models lack the ability to localize foreground objects, and the learned representation can be predominantly determined by background pixels
  • Based on the previous findings, we propose to copy the foreground object estimated from the saliency methods in Section 4, and paste that onto various backgrounds as a means of data-driven augmentation for learning localization
  • We evaluate the transfer learning ability of our model on object recognition, scene categorization, and object detection benchmarks, and compare with the state-of-the-art methods
  • We identified a strong error pattern among self-supervised models in their failure to localize foreground objects
Methods
  • The authors employ two existing methods with minor adjustments.

    – Nearest Neighbors.
  • The magnitude of class-score gradients in the pixel space provides information about how important the pixels are for classification.
  • This approach has proven to be strong for weakly-supervised object localization [36].
  • Since self-supervised models do not have classifiers for objects, the authors train a linear classifier on top of the extracted features.
Results
  • The authors evaluate the transfer learning ability of the model on object recognition, scene categorization, and object detection benchmarks, and compare with the state-of-the-art methods.

    Supervised Ours Supervised Ours.
  • The authors evaluate the transfer learning ability of the model on object recognition, scene categorization, and object detection benchmarks, and compare with the state-of-the-art methods
Conclusion
  • The authors identified a strong error pattern among self-supervised models in their failure to localize foreground objects.
  • The authors propose a simple data-driven approach to distill localization via learning invariance against backgrounds.
  • The improvements achieved suggest that the localization problem for self-supervised representation learning is prevalent.
  • The authors' method may not be the ideal way to solve this localization problem.
  • The authors are interested in finding a clever “proxy task” which can help distill such localization abilities
Summary
  • Introduction:

    Visual recognition has been revolutionized by deep learning in the fashion of assembling considerable amounts of labeled data [7] and training very deep neural networks [25].
  • RBD improves the performance by 2.8% and the saliency network BASNet by 6.4%.
  • In Table 2, the authors find a correlation between the saliency performance on the saliency benchmark and the self-supervised representation learning.
  • Texture backgrounds improve the performance very marginally.
  • This is possibly because textured images in the dataset [3] are outside of the ImageNet distribution.
  • With only 20% to 50% of images receiving copy-and-pastes, the authors significantly improve the performance by 3%−6%.
  • Objectives:

    The authors' goal is to learn a representation from which the foreground object can be automatically localized, such that discriminative regions can be focused on to improve recognition.
  • Methods:

    The authors employ two existing methods with minor adjustments.

    – Nearest Neighbors.
  • The magnitude of class-score gradients in the pixel space provides information about how important the pixels are for classification.
  • This approach has proven to be strong for weakly-supervised object localization [36].
  • Since self-supervised models do not have classifiers for objects, the authors train a linear classifier on top of the extracted features.
  • Results:

    The authors evaluate the transfer learning ability of the model on object recognition, scene categorization, and object detection benchmarks, and compare with the state-of-the-art methods.

    Supervised Ours Supervised Ours.
  • The authors evaluate the transfer learning ability of the model on object recognition, scene categorization, and object detection benchmarks, and compare with the state-of-the-art methods
  • Conclusion:

    The authors identified a strong error pattern among self-supervised models in their failure to localize foreground objects.
  • The authors propose a simple data-driven approach to distill localization via learning invariance against backgrounds.
  • The improvements achieved suggest that the localization problem for self-supervised representation learning is prevalent.
  • The authors' method may not be the ideal way to solve this localization problem.
  • The authors are interested in finding a clever “proxy task” which can help distill such localization abilities
Tables
  • Table1: A comparison study of the role of data augmentations for learning selfsupervised and supervised representations. Please refer to the main text for details
  • Table2: Ablation studies for investigating copy-and-pasting augmentations: (a) on various saliency estimation methods (b) on controlling the ratio of using copy-andpasting augmentation (c) on various background images (d) on blending options
  • Table3: Comparisons with baselines
  • Table4: State-of-the-arts comparisons
  • Table5: Transfer learning for object detection on VOC 2007 using Faster R-CNN with R50-C4. We present the gap to ImageNet supervised pre-training in the brackets for reference. For MoCo [<a class="ref-link" id="c21" href="#r21">21</a>], we use the officially released model for finetuning. All numbers are the averages of three independent runs
  • Table6: Scene recognition on Places. LA reports 10-crop accuracy on this dataset
Download tables as Excel
Related work
  • Unsupervised and Self-Supervised Learning. Unsupervised learning aims to extract semantically meaningful representations without human labels [34]. Self-supervised learning is a sub-branch of unsupervised learning which automatically generates learning signals from the data itself. These learning signals have been derived from proxy tasks that involve semantic image understanding but do not require semantic labels for training. These tasks have been based on prediction of color [50], context [9, 31], rotation [16], and motion [30]. Autoencoders [40] and GANs [17, 11] have also shown promising results for representation learning through reconstructing images.
Reference
  • Arandjelovic, R., Zisserman, A.: Object discovery with a copy-pasting gan. arXiv preprint arXiv:1905.11369 (2019)
    Findings
  • Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910 (2019)
    Findings
  • Bylinskii, Z., Judd, T., Borji, A., Itti, L., Durand, F., Oliva, A., Torralba, A.: Mit saliency benchmark
    Google ScholarFindings
  • Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
    Google ScholarLocate open access versionFindings
  • Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H., Hu, S.M.: Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (2014)
    Google ScholarLocate open access versionFindings
  • Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 113–123 (2019)
    Google ScholarLocate open access versionFindings
  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. Ieee (2009)
    Google ScholarFindings
  • DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
    Findings
  • Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision (2015)
    Google ScholarLocate open access versionFindings
  • Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2051–2060 (2017)
    Google ScholarLocate open access versionFindings
  • Donahue, J., Simonyan, K.: Large scale adversarial representation learning. arXiv preprint arXiv:1907.02544 (2019)
    Findings
  • Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision (2015)
    Google ScholarLocate open access versionFindings
  • Dosovitskiy, A., Fischer, P., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence (2015)
    Google ScholarLocate open access versionFindings
  • Dwibedi, D., Misra, I., Hebert, M.: Cut, paste and learn: Surprisingly easy synthesis for instance detection. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
    Google ScholarLocate open access versionFindings
  • Fang, H.S., Sun, J., Wang, R., Gou, M., Li, Y.L., Lu, C.: Instaboost: Boosting instance segmentation via probability map guided copy-pasting. arXiv preprint arXiv:1908.07801 (2019)
    Findings
  • Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)
    Findings
  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems (2014)
    Google ScholarLocate open access versionFindings
  • Gulshan, V., Rother, C., Criminisi, A., Blake, A., Zisserman, A.: Geodesic star convexity for interactive image segmentation. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE (2010)
    Google ScholarLocate open access versionFindings
  • Gutmann, M., Hyvarinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (2010)
    Google ScholarLocate open access versionFindings
  • Han, J., Zhang, D., Hu, X., Guo, L., Ren, J., Wu, F.: Background priorbased salient object detection via deep reconstruction residual. IEEE Transactions on Circuits and Systems for Video Technology (2014)
    Google ScholarLocate open access versionFindings
  • He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019)
    Findings
  • He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (2016)
    Google ScholarLocate open access versionFindings
  • Jiang, B., Zhang, L., Lu, H., Yang, C., Yang, M.H.: Saliency detection via absorbing markov chain. In: Proceedings of the IEEE international conference on computer vision (2013)
    Google ScholarLocate open access versionFindings
  • Jiang, P., Ling, H., Yu, J., Peng, J.: Salient region detection by ufo: Uniqueness, focusness and objectness. In: Proceedings of the IEEE international conference on computer vision (2013)
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems (2012)
    Google ScholarLocate open access versionFindings
  • MediaLab, M.: Vistex texture database. Web site: http://vismod.media.mit.edu/vismod/imagery/VisionTexture/vistex.html (1995)
    Findings
  • Nguyen, T., Dax, M., Mummadi, C.K., Ngo, N., Nguyen, T.H.P., Lou, Z., Brox, T.: Deepusps: Deep robust unsupervised saliency prediction via selfsupervision. In: Advances in Neural Information Processing Systems (2019)
    Google ScholarLocate open access versionFindings
  • Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
    Findings
  • Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free?weakly-supervised learning with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
    Google ScholarLocate open access versionFindings
  • Pathak, D., Girshick, R., Dollar, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2701–2710 (2017)
    Google ScholarLocate open access versionFindings
  • Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition (2016)
    Google ScholarLocate open access versionFindings
  • Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M.: Basnet: Boundary-aware salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
    Google ScholarLocate open access versionFindings
  • Ratner, A.J., Ehrenberg, H., Hussain, Z., Dunnmon, J., Re, C.: Learning to compose domain-specific transformations for data augmentation. In: Advances in neural information processing systems. pp. 3236–3246 (2017)
    Google ScholarFindings
  • de Sa, V.R.: Learning classification with unlabeled data. In: Advances in neural information processing systems (1994)
    Google ScholarLocate open access versionFindings
  • Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
    Google ScholarLocate open access versionFindings
  • Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
    Findings
  • Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014)
    Findings
  • Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019)
    Findings
  • Torralba, A.: Contextual priming for object detection. International journal of computer vision (2003)
    Google ScholarLocate open access versionFindings
  • Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on Machine learning. ACM (2008)
    Google ScholarLocate open access versionFindings
  • Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., Ruan, X.: Learning to detect salient objects with image-level supervision. In: CVPR (2017)
    Google ScholarFindings
  • Wang, X., He, K., Gupta, A.: Transitive invariance for self-supervised visual representation learning. In: Proceedings of the IEEE international conference on computer vision (2017)
    Google ScholarLocate open access versionFindings
  • Wei, Y., Wen, F., Zhu, W., Sun, J.: Geodesic saliency using background priors. In: European conference on computer vision. Springer (2012)
    Google ScholarFindings
  • Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via nonparametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
    Google ScholarLocate open access versionFindings
  • Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013)
    Google ScholarLocate open access versionFindings
  • Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: Proceedings of the IEEE conference on computer vision and pattern recognition (2013)
    Google ScholarLocate open access versionFindings
  • Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer (2014)
    Google ScholarFindings
  • Zhang, D., Han, J., Zhang, Y.: Supervision by fusion: Towards unsupervised learning of deep salient object detector. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
    Google ScholarLocate open access versionFindings
  • Zhang, J., Zhang, T., Dai, Y., Harandi, M., Hartley, R.: Deep unsupervised saliency detection: A multiple noisy labeling perspective. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
    Google ScholarLocate open access versionFindings
  • Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: European conference on computer vision. Springer (2016)
    Google ScholarFindings
  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856 (2014)
    Findings
  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition (2016)
    Google ScholarLocate open access versionFindings
  • Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in neural information processing systems (2014)
    Google ScholarLocate open access versionFindings
  • Zhu, W., Liang, S., Wei, Y., Sun, J.: Saliency optimization from robust background detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (2014)
    Google ScholarLocate open access versionFindings
  • Zhuang, C., Zhai, A.L., Yamins, D.: Local aggregation for unsupervised learning of visual embeddings. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments