Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases

NeurIPS 2020, 2020.

Cited by: 6|Bibtex|Views93|Links
Keywords:
category instancesemantic segmentationaggressive croppingobject recognitionrecent workMore(10+)
Weibo:
We demonstrate that these self-supervised representations learn occlusion invariance by employing an aggressive cropping strategy which heavily relies on an object-centric dataset bias

Abstract:

Self-supervised representation learning approaches have recently surpassed their supervised learning counterparts on downstream tasks like object detection and image classification. Somewhat mysteriously the recent gains in performance come from training instance classification models, treating each image and it's augmented versions as ...More

Code:

Data:

0
Introduction
  • Inspired by biological agents and necessitated by the manual annotation bottleneck, there has been growing interest in self-supervised visual representation learning.
  • Work in self-supervised learning focused on using “pretext” tasks for which ground-truth is free and can be procured through an automated process [3, 4].
  • The common theme across recent works is the focus on the instance discrimination task [9] – treating every instance as a class of its own.
  • The contrastive loss[5, 7] has proven to be a useful objective function for instance discrimination, but requires gathering pairs of samples belonging to the same class.
  • Contrastive loss and aggressive augmentation are the three key ingredients underlying these new gains
Highlights
  • Inspired by biological agents and necessitated by the manual annotation bottleneck, there has been growing interest in self-supervised visual representation learning
  • We show that the learned representation outperforms MoCo-v2 [10] trained on the same data in terms of viewpoint invariance, category instance invariance, occlusion invariance and demonstrates improved performance on object recognition tasks
  • PIRL has slightly better occlusion invariance compared to MOCO which be attributed to the more aggressive cropping transformation used by PIRL
  • Since our analysis suggests that aggressive cropping is detrimental, we aim to explore an alternative in order to improve the visual representation learned by MOCOv2
  • We demonstrate that these self-supervised representations learn occlusion invariance by employing an aggressive cropping strategy which heavily relies on an object-centric dataset bias
  • We demonstrate that compared to supervised models, these representations possess inferior viewpoint, illumination direction and category instance invariances
Methods
  • Viewpoint Illumination Dir. Illumination Color Instance.
  • The final Top-K Representation Invariance Score (RIS) can be computed by averaging target conditioned invariance for top-K neurons and computing the mean over all targets.
  • The authors convert the Top-K RIS to a percentage of the maximum possible value (i.e. for all neurons Ly(i) = 1 ∀y ∈ Y).
  • Since the authors wish to study the properties relevant for object recognition tasks, the authors focus on invariances to viewpoint, occlusion, illumination direction, illumination color, instance and a combination of instance and viewpoint changes.
  • The authors describe the datasets used to evaluate these invariances and will publicly release the code to reproduce the invariance evaluation metrics on these datasets
Conclusion
  • The aggressive cropping in MOCO and PIRL creates pairs of images that depict parts of objects, thereby simulating occluded objects.
  • The authors do observe that the self-supervised approaches MOCO and PIRL have significantly higher occlusion invariance compared to an Imagenet supervised model.
  • PIRL has slightly better occlusion invariance compared to MOCO which be attributed to the more aggressive cropping transformation used by PIRL.
  • The authors present a framework to evaluate invariances in representations
  • Using this framework, the authors demonstrate that these self-supervised representations learn occlusion invariance by employing an aggressive cropping strategy which heavily relies on an object-centric dataset bias.
  • The authors propose an alternative strategy to improve invariances in these representations by leveraging naturally occurring temporal transformations in videos
Summary
  • Introduction:

    Inspired by biological agents and necessitated by the manual annotation bottleneck, there has been growing interest in self-supervised visual representation learning.
  • Work in self-supervised learning focused on using “pretext” tasks for which ground-truth is free and can be procured through an automated process [3, 4].
  • The common theme across recent works is the focus on the instance discrimination task [9] – treating every instance as a class of its own.
  • The contrastive loss[5, 7] has proven to be a useful objective function for instance discrimination, but requires gathering pairs of samples belonging to the same class.
  • Contrastive loss and aggressive augmentation are the three key ingredients underlying these new gains
  • Objectives:

    Since the analysis suggests that aggressive cropping is detrimental, the authors aim to explore an alternative in order to improve the visual representation learned by MOCOv2.
  • The goal of this work is to demystify the efficacy of constrastive self-supervised representations on object recognition tasks
  • Methods:

    Viewpoint Illumination Dir. Illumination Color Instance.
  • The final Top-K Representation Invariance Score (RIS) can be computed by averaging target conditioned invariance for top-K neurons and computing the mean over all targets.
  • The authors convert the Top-K RIS to a percentage of the maximum possible value (i.e. for all neurons Ly(i) = 1 ∀y ∈ Y).
  • Since the authors wish to study the properties relevant for object recognition tasks, the authors focus on invariances to viewpoint, occlusion, illumination direction, illumination color, instance and a combination of instance and viewpoint changes.
  • The authors describe the datasets used to evaluate these invariances and will publicly release the code to reproduce the invariance evaluation metrics on these datasets
  • Conclusion:

    The aggressive cropping in MOCO and PIRL creates pairs of images that depict parts of objects, thereby simulating occluded objects.
  • The authors do observe that the self-supervised approaches MOCO and PIRL have significantly higher occlusion invariance compared to an Imagenet supervised model.
  • PIRL has slightly better occlusion invariance compared to MOCO which be attributed to the more aggressive cropping transformation used by PIRL.
  • The authors present a framework to evaluate invariances in representations
  • Using this framework, the authors demonstrate that these self-supervised representations learn occlusion invariance by employing an aggressive cropping strategy which heavily relies on an object-centric dataset bias.
  • The authors propose an alternative strategy to improve invariances in these representations by leveraging naturally occurring temporal transformations in videos
Tables
  • Table1: Invariances learned from Imagenet: We compare invariances encoded in supervised and self-supervised representations learned on the Imagenet dataset. We consider invariances that are useful for object recognition tasks. See text for details about the datasets used. We observe that compared to the supervised model, the contrastive self-supervised approaches are better only at occlusion invariance
  • Table2: Discriminative power of representations: We compare representations trained on different datasets, in supervised and selfsupervised settings, on the task of image classification. We observe that representations trained on object-centric datasets, like Imagenet and cropped boxes from MSCOCO, are better at discriminating objects. We also demonstrate that the standard classification setting of Pascal VOC is not an ideal testbed for self-supervised representations since it does not test the ability to discriminate frequently co-occurring objects
  • Table3: Evaluating Video representations: We evaluate our proposed approach to learn representations by leveraging temporal transformations in the contrastive learning framework. We observe that leveraging frame-level and region-level temporal transformations improves the discriminative power of the representations. We present results on four datasets - Pascal, Pascal Cropped Boxes, Imagenet (image classification) and ADE20K (semantic segmentation)
  • Table4: Invariances of Video representations: We evaluate the invariances in the representations learned by our proposed approach that leverages frame-level (row 2) and region-level (row 3, 4) temporal transformations. We observe compared to the Baseline MOCOv2 model, the models that leverage temporal transformations demonstrate higher viewpoint invariance, illumination invariance, category instance invariance and instance+viewpoint invariance
Download tables as Excel
Related work
  • A large body of research in Computer Vision is dedicated to training feature extraction models, particularly deep neural networks, without the use of human-annotated data. These learned representations are intended to be useful for a wide range of downstream tasks. Research in this domain can be coarsely classified into generative modeling [12, 13, 14, 15, 16, 17] and self-supervised representation learning[3, 4, 18, 19].

    Pretext Tasks Self-supervised learning involves training deep neural networks by constructing “pretext" tasks for which data can be automatically gathered without human intervention. Numerous such pretext tasks have been proposed in recent literature including predicting relative location of patches in images[3], learning to match tracked patches[4], predicting the angle of rotation in an artificially rotated image[19], predicting the colors in a grayscale image[6] and filling in missing parts of images[20]. These tasks are manually designed by experts to ensure that the learned representations are useful for downstream tasks like object detection, image classification and semantic segmentation. However, the intuitions behind the design are generally not verified experimentally due to the lack of a proper evaluation framework beyond the metrics of the downstream tasks. While we do not study these methods in our work, our proposed framework to understand representations (Section 4) can directly be applied to any representation. In many cases, it can be used to verify the motivations for the pretext tasks.
Reference
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” arXiv preprint arXiv:1911.05722, 2019. 1, 2, 3, 4
    Findings
  • I. Misra and L. van der Maaten, “Self-supervised learning of pretext-invariant representations,” arXiv preprint arXiv:1912.01991, 2019. 1, 2, 3, 5, 6
    Findings
  • C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430, 2015. 1, 3, 4
    Google ScholarLocate open access versionFindings
  • X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802, 2015. 1, 3, 4
    Google ScholarLocate open access versionFindings
  • A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018. 1, 2, 3
    Findings
  • R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in European conference on computer vision, pp. 649–666, Springer, 2011, 3
    Google ScholarLocate open access versionFindings
  • O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord, “Data-efficient image recognition with contrastive predictive coding,” arXiv preprint arXiv:1905.09272, 2019. 1, 2, 3
    Findings
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” arXiv preprint arXiv:2002.05709, 2020. 1, 3
    Findings
  • A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox, “Discriminative unsupervised feature learning with convolutional neural networks,” in Advances in neural information processing systems, pp. 766–774, 2014. 1, 3
    Google ScholarLocate open access versionFindings
  • X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020. 2, 3, 4, 10, 11
    Findings
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742, 2018. 3
    Google ScholarLocate open access versionFindings
  • Y. Tang, R. Salakhutdinov, and G. Hinton, “Robust boltzmann machines for recognition and denoising,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2264–2271, IEEE, 203
    Google ScholarLocate open access versionFindings
  • H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proceedings of the 26th annual international conference on machine learning, pp. 609–616, 2009. 3
    Google ScholarLocate open access versionFindings
  • P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, pp. 1096–1103, 2008. 3
    Google ScholarLocate open access versionFindings
  • A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 203
    Findings
  • D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013. 3
    Findings
  • C. Doersch, “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908, 2016. 3
    Findings
  • C. Doersch and A. Zisserman, “Multi-task self-supervised visual learning,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060, 2017. 3
    Google ScholarLocate open access versionFindings
  • S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” arXiv preprint arXiv:1803.07728, 2018. 3
    Findings
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544, 2016. 3
    Google ScholarLocate open access versionFindings
  • Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” arXiv preprint arXiv:1906.05849, 2019. 3
    Findings
  • X. Wang, K. He, and A. Gupta, “Transitive invariance for self-supervised visual representation learning,” in Proceedings of the IEEE international conference on computer vision, pp. 1329–1338, 2017. 3
    Google ScholarLocate open access versionFindings
  • X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence from the cycle-consistency of time,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2566–2576, 2019. 3
    Google ScholarLocate open access versionFindings
  • D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan, “Learning features by watching objects move,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2701– 2710, 2017. 3
    Google ScholarLocate open access versionFindings
  • P. Goyal, D. Mahajan, A. Gupta, and I. Misra, “Scaling and benchmarking self-supervised visual representation learning,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 6391–6400, 2019. 3, 4
    Google ScholarLocate open access versionFindings
  • I. Goodfellow, H. Lee, Q. V. Le, A. Saxe, and A. Y. Ng, “Measuring invariances in deep networks,” in Advances in neural information processing systems, pp. 646–654, 2009. 3, 4, 5, 10
    Google ScholarFindings
  • D. Bau, J.-Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman, and A. Torralba, “Visualizing and understanding generative adversarial networks,” arXiv preprint arXiv:1901.09887, 2019. 3
    Findings
  • B. Zhou, D. Bau, A. Oliva, and A. Torralba, “Comparing the interpretability of deep networks via network dissection,” in Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 243–252, Springer, 2019. 3
    Google ScholarFindings
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, pp. 618–626, 2017. 3
    Google ScholarLocate open access versionFindings
  • R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “Grad-cam: Why did you say that?,” arXiv preprint arXiv:1611.07450, 2016. 3
    Findings
  • Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, “What makes for good views for contrastive learning,” arXiv preprint arXiv:2005.10243, 2020. 3
    Findings
  • T. Wang and P. Isola, “Understanding contrastive representation learning through alignment and uniformity on the hypersphere,” arXiv preprint arXiv:2005.10242, 2020. 4
    Findings
  • L. Huang, X. Zhao, and K. Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 5
    Google ScholarLocate open access versionFindings
  • Y. Xiang, R. Mottaghi, and S. Savarese, “Beyond pascal: A benchmark for 3d object detection in the wild,” in IEEE winter conference on applications of computer vision, pp. 75–82, IEEE, 2014. 5
    Google ScholarLocate open access versionFindings
  • J.-M. Geusebroek, G. J. Burghouts, and A. W. Smeulders, “The amsterdam library of object images,” International Journal of Computer Vision, vol. 61, no. 1, pp. 103–112, 2005. 5
    Google ScholarLocate open access versionFindings
  • B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016. 6
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.” http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html.6
    Findings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, pp. 740–755, Springer, 2014. 6
    Google ScholarLocate open access versionFindings
  • M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem, “Trackingnet: A large-scale dataset and benchmark for object tracking in the wild,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 300–317, 2018. 7, 10
    Google ScholarLocate open access versionFindings
  • J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013. 7, 11
    Google ScholarLocate open access versionFindings
  • R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015. 7, 11
    Google ScholarLocate open access versionFindings
  • B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641, 2017. 8
    Google ScholarLocate open access versionFindings
  • J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015. 8
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments