Revisiting Self-Supervised Visual Representation Learning

CVPR, 2019.

Cited by: 95|Bibtex|Views79|Links
EI
Keywords:
unsupervised visual representationmulti-layer perceptronlarge scalepretext taskstochastic gradient descentMore(8+)
Weibo:
As part of our study, we drastically boost the performance of previously proposed techniques and outperform previously published state-of-the-art results by a large margin

Abstract:

Unsupervised visual representation learning remains a largely unsolved problem in computer vision research. Among a big body of recently proposed approaches for unsupervised learning of visual representations, a class of self-supervised techniques achieves superior performance on many challenging benchmarks. A large number of the pretext ...More

Code:

Data:

0
Introduction
  • Automated computer vision systems have recently made drastic progress. Many models for tackling challenging tasks such as object recognition, semantic segmentation or object detection can compete with humans on complex visual benchmarks [15, 45, 14].
  • The success of such systems hinges on a large amount of labeled data, which is not always available and often prohibitively expensive to acquire.
  • These systems are tailored to specific scenarios, e.g. a model trained on the ImageNet (ILSVRC-2012) dataset [38] can only recognize 1000 semantic categories or a model that was trained to perceive road traffic at daylight may not work in darkness [5, 4].
  • Patch Loc. [6] Jigsaw [29]
Highlights
  • Automated computer vision systems have recently made drastic progress
  • We propose to have a closer look at convolutional neural networks (CNN) architectures
  • Our work is complimentary to the previously discussed methods, which introduce new pretext tasks, since we show how existing self-supervision methods can significantly benefit from our insights
  • We have investigated self-supervised visual representation learning from the previously unexplored angles
  • We uncovered multiple important insights, namely that (1) lessons from architecture design in the fullysupervised setting do not necessarily translate to the selfsupervised setting; (2) contrary to previously popular architectures like AlexNet, in residual architectures, the final prelogits layer consistently results in the best performance; (3) the widening factor of CNNs has a drastic effect on performance of self-supervised techniques and (4) stochastic gradient descent (SGD) training of linear logistic regression may require very long time to converge
  • As a result of selecting the right architecture for each self-supervision and increasing the widening factor, our models significantly outperform previously reported results
  • Though, we reveal that neither is the ranking of architectures consistent across different methods, nor is the ranking of methods consistent across architectures. This implies that pretext tasks for self-supervised learning should not be considered in isolation, but in conjunction with underlying architectures
Results
  • The authors present and interpret results of the large-scale study.
  • All self-supervised models are trained on ImageNet and evaluated on the own hold-out validation splits of ImageNet and Places205.
  • In Table 2, when the authors compare to the results from the prior literature, the authors use the official ImageNet and Places205 validation splits
Conclusion
  • The authors have investigated self-supervised visual representation learning from the previously unexplored angles.
  • Though, the authors reveal that neither is the ranking of architectures consistent across different methods, nor is the ranking of methods consistent across architectures.
  • This implies that pretext tasks for self-supervised learning should not be considered in isolation, but in conjunction with underlying architectures
Summary
  • Introduction:

    Automated computer vision systems have recently made drastic progress. Many models for tackling challenging tasks such as object recognition, semantic segmentation or object detection can compete with humans on complex visual benchmarks [15, 45, 14].
  • The success of such systems hinges on a large amount of labeled data, which is not always available and often prohibitively expensive to acquire.
  • These systems are tailored to specific scenarios, e.g. a model trained on the ImageNet (ILSVRC-2012) dataset [38] can only recognize 1000 semantic categories or a model that was trained to perceive road traffic at daylight may not work in darkness [5, 4].
  • Patch Loc. [6] Jigsaw [29]
  • Results:

    The authors present and interpret results of the large-scale study.
  • All self-supervised models are trained on ImageNet and evaluated on the own hold-out validation splits of ImageNet and Places205.
  • In Table 2, when the authors compare to the results from the prior literature, the authors use the official ImageNet and Places205 validation splits
  • Conclusion:

    The authors have investigated self-supervised visual representation learning from the previously unexplored angles.
  • Though, the authors reveal that neither is the ranking of architectures consistent across different methods, nor is the ranking of methods consistent across architectures.
  • This implies that pretext tasks for self-supervised learning should not be considered in isolation, but in conjunction with underlying architectures
Tables
  • Table1: Evaluation of representations from self-supervised techniques based on various CNN architectures. The scores are accuracies (in %) of a linear logistic regression model trained on top of these representations using ImageNet training split. Our validation split is used for computing accuracies. The architectures marked by a “(-)” are slight variations described in Section 3.1. Sub-columns such as 4× correspond to widening factors. Top-performing architectures in a column are bold; the best pretext task for each model is underlined
  • Table2: Comparison of the published self-supervised models to our best models. The scores correspond to accuracy of linear logistic regression that is trained on top of representations provided by self-supervised models. Official validation splits of ImageNet and Places205 are used for computing accuracies. The “Family” column shows which basic model architecture was used in the referenced literature: AlexNet, VGG-style, or Residual
Download tables as Excel
Related work
  • Self-supervision is a learning framework in which a supervised signal for a pretext task is created automatically, in an effort to learn representations that are useful for solving real-world downstream tasks. Being a generic framework, self-supervision enjoys a wide number of applications, ranging from robotics to image understanding.

    https://github.com/google/revisiting- self- supervised

    In robotics, both the result of interacting with the world, and the fact that multiple perception modalities simultaneously get sensory inputs are strong signals which can be exploited to create self-supervised tasks [21, 41, 26, 10].

    Similarly, when learning representation from videos, one can either make use of the synchronized cross-modality stream of audio, video, and potentially subtitles [35, 39, 23, 44], or of the consistency in the temporal dimension [41].

    In this paper we focus on self-supervised techniques that learn from image databases. These techniques have demonstrated impressive results for learning high-level image representations. Inspired by unsupervised methods from the natural language processing domain which rely on predicting words from their context [28], Doersch et al [7] proposed a practically successful pretext task of predicting the relative location of image patches. This work spawned a line of work in patch-based self-supervised visual representation learning methods. These include a model from [31] that predicts the permutation of a “jigsaw puzzle” created from the full image and recent follow-ups [29, 33].
Reference
  • J. Behrmann, D. Duvenaud, and J.-H. Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2018.
    Findings
  • M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. European Conference on Computer Vision (ECCV), 2018.
    Google ScholarLocate open access versionFindings
  • T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby. Selfsupervised generative adversarial networks. In Conference on Computer Vision and Pattern Recognition (CVPR). 2019.
    Google ScholarLocate open access versionFindings
  • Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool. Domain adaptive faster R-CNN for object detection in the wild. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
    Google ScholarLocate open access versionFindings
  • D. Dai and L. Van Gool. Dark model adaptation: Semantic image segmentation from daytime to nighttime. arXiv preprint arXiv:1810.02575, 2018.
    Findings
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP. In International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In International Conference on Computer Vision (ICCV), 2015.
    Google ScholarLocate open access versionFindings
  • C. Doersch and A. Zisserman. Multi-task self-supervised visual learning. In International Conference on Computer Vision (ICCV), 2017.
    Google ScholarLocate open access versionFindings
  • A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), 2014.
    Google ScholarLocate open access versionFindings
  • F. Ebert, S. Dasari, A. X. Lee, S. Levine, and C. Finn. Robustness via retrying: Closed-loop robotic manipulation with self-supervised learning. Conference on Robot Learning (CoRL), 2018.
    Google ScholarFindings
  • S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems (NIPS), 2017.
    Google ScholarLocate open access versionFindings
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, 2014.
    Google ScholarLocate open access versionFindings
  • K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In International Conference on Computer Vision (ICCV). IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International conference on computer vision (ICCV), pages 1026–1034, 2015.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision (ECCV). Springer, 2016.
    Google ScholarLocate open access versionFindings
  • A. Hermans, L. Beyer, and B. Leibe. In Defense of the Triplet Loss for Person Re-Identification. arXiv preprint arXiv:1703.07737, 2017.
    Findings
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning (ICML), 2015.
    Google ScholarLocate open access versionFindings
  • J. Jacobsen, A. W. M. Smeulders, and E. Oyallon. i-RevNet: Deep invertible networks. In International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • E. Jang, C. Devin, V. Vanhoucke, and S. Levine. Grasp2Vec: Learning object representations from self-supervised grasping. In Conference on Robot Learning, 2018.
    Google ScholarLocate open access versionFindings
  • D. Kim, D. Cho, D. Yoo, and I. S. Kweon. Learning image representations by completing damaged jigsaw puzzles. Winter Conference on Applications of Computer Vision (WACV), 2018.
    Google ScholarLocate open access versionFindings
  • B. Korbar, D. Tran, and L. Torresani. Cooperative learning of audio and video models from self-supervised synchronization. arXiv preprint arXiv:1807.00230, 2018.
    Findings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), 2012.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), 2012.
    Google ScholarLocate open access versionFindings
  • M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg. Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks. arXiv preprint arXiv:1810.10191, 2018.
    Findings
  • D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    Findings
  • T. N. Mundhenk, D. Ho, and B. Y. Chen. Improvements to context based self-supervised learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
    Google ScholarLocate open access versionFindings
  • V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In International conference on machine learning (ICML), 2010.
    Google ScholarLocate open access versionFindings
  • M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision (ECCV), 2016.
    Google ScholarLocate open access versionFindings
  • M. Noroozi, H. Pirsiavash, and P. Favaro. Representation learning by learning to count. In International Conference on Computer Vision (ICCV), 2017.
    Google ScholarLocate open access versionFindings
  • M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash. Boosting self-supervised learning via knowledge transfer. Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
    Google ScholarLocate open access versionFindings
  • A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
    Findings
  • A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. European Conference on Computer Vision (ECCV), 2018.
    Google ScholarLocate open access versionFindings
  • D. Pathak, R. B. Girshick, P. Dollar, T. Darrell, and B. Hariharan. Learning features by watching objects move. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
    Google ScholarLocate open access versionFindings
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. Efros. Context encoders: Feature learning by inpainting. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211– 252, 2015.
    Google ScholarLocate open access versionFindings
  • N. Sayed, B. Brattoli, and B. Ommer. Cross and learn: Cross-modal self-supervision. arXiv preprint arXiv:1811.03879, 2018.
    Findings
  • F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Computer Vision and Pattern Recognition (CVPR), 2015.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888, 2017.
    Findings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
    Google ScholarLocate open access versionFindings
  • O. Wiles, A. Koepke, and A. Zisserman. Self-supervised learning of a facial attribute embedding from video. In British Machine Vision Conference (BMVC), 2018.
    Google ScholarLocate open access versionFindings
  • S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • S. Zagoruyko and N. Komodakis. Wide residual networks. British Machine Vision Conference (BMVC), 2016.
    Google ScholarLocate open access versionFindings
  • R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In European Conference on Computer Vision (ECCV), 2016.
    Google ScholarLocate open access versionFindings
  • R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems (NIPS). 2014.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments