Scaling and Benchmarking Self-Supervised Visual Representation Learning

ICCV, pp. 6390-6399, 2019.

Cited by: 77|Bibtex|Views127|Links
EI
Keywords:
surface normal estimationconvolutional neural networklarge scaledata sizeunsupervised learningMore(4+)
Weibo:
We studied the effect of scaling two selfsupervised approaches along three axes: data size, model capacity and problem complexity

Abstract:

Self-supervised learning aims to learn representations from the data itself without explicit manual supervision. Existing efforts ignore a crucial aspect of self-supervised learning - the ability to scale to large amount of data because self-supervision requires no manual labels. In this work, we revisit this principle and scale two pop...More

Code:

Data:

0
Introduction
  • Computer vision has been revolutionized by high capacity Convolutional Neural Networks (ConvNets) [43] and large-scale labeled data (e.g., ImageNet [12]).
  • Even at that scale, performance increases only loglinearly with the amount of labeled data.
  • What has worked for computer vision in the last five years has become a bottleneck: the size, quality, and availability of supervised data.
  • One alternative to overcome this bottleneck is to use the self-supervised learning paradigm.
Highlights
  • Computer vision has been revolutionized by high capacity Convolutional Neural Networks (ConvNets) [43] and large-scale labeled data (e.g., ImageNet [12])
  • Our results show that by scaling along the three axes, self-supervised learning can outperform ImageNet supervised pre-training using the same evaluation setup on nonsemantic tasks of Surface Normal Estimation and Navigation
  • The agent is spawned at random locations and must build a contextual map in order to be successful at the task
  • We studied the effect of scaling two selfsupervised approaches along three axes: data size, model capacity and problem complexity
  • Our results indicate that transfer performance increases log-linearly with the data size
  • We believe future work should focus on designing tasks that are complex enough to exploit large scale data and increased model capacity
Methods
  • Method ResNet

    50 ImageNet-1k Supervised∗ ResNet-50 ImageNet-1k Supervised ResNet-50 Places205 Supervised ResNet-50 Jigsaw ImageNet-1k ResNet-50 Jigsaw ImageNet-22k ResNet-50 Jigsaw YFCC-100M 6.3.
  • The agent is spawned at random locations and must build a contextual map in order to be successful at the task.
  • Setup: The authors use the setup from [64] who train an agent using reinforcement learning (PPO [65]) in the Gibson environment [78].
  • The agent uses fixed feature representations from a ConvNet for this task and only updates the policy network.
  • 50 ImageNet-1k Supervised ResNet-50 ResNet-50 Jigsaw ImageNet-1k ResNet-50 Jigsaw ImageNet-22k ResNet-50 Jigsaw YFCC-100M.
Conclusion
  • The authors studied the effect of scaling two selfsupervised approaches along three axes: data size, model capacity and problem complexity.
  • The authors' results indicate that transfer performance increases log-linearly with the data size.
  • The quality of the representations improves with higher capacity models and problem complexity.
  • The authors propose a benchmark suite of 9 diverse tasks to evaluate the quality of the learned representations.
  • The authors believe future work should focus on designing tasks that are complex enough to exploit large scale data and increased model capacity.
  • The authors' experiments suggest that scaling self-supervision is crucial but there is still a long way to go before definitively surpassing supervised pre-training
Summary
  • Introduction:

    Computer vision has been revolutionized by high capacity Convolutional Neural Networks (ConvNets) [43] and large-scale labeled data (e.g., ImageNet [12]).
  • Even at that scale, performance increases only loglinearly with the amount of labeled data.
  • What has worked for computer vision in the last five years has become a bottleneck: the size, quality, and availability of supervised data.
  • One alternative to overcome this bottleneck is to use the self-supervised learning paradigm.
  • Methods:

    Method ResNet

    50 ImageNet-1k Supervised∗ ResNet-50 ImageNet-1k Supervised ResNet-50 Places205 Supervised ResNet-50 Jigsaw ImageNet-1k ResNet-50 Jigsaw ImageNet-22k ResNet-50 Jigsaw YFCC-100M 6.3.
  • The agent is spawned at random locations and must build a contextual map in order to be successful at the task.
  • Setup: The authors use the setup from [64] who train an agent using reinforcement learning (PPO [65]) in the Gibson environment [78].
  • The agent uses fixed feature representations from a ConvNet for this task and only updates the policy network.
  • 50 ImageNet-1k Supervised ResNet-50 ResNet-50 Jigsaw ImageNet-1k ResNet-50 Jigsaw ImageNet-22k ResNet-50 Jigsaw YFCC-100M.
  • Conclusion:

    The authors studied the effect of scaling two selfsupervised approaches along three axes: data size, model capacity and problem complexity.
  • The authors' results indicate that transfer performance increases log-linearly with the data size.
  • The quality of the representations improves with higher capacity models and problem complexity.
  • The authors propose a benchmark suite of 9 diverse tasks to evaluate the quality of the learned representations.
  • The authors believe future work should focus on designing tasks that are complex enough to exploit large scale data and increased model capacity.
  • The authors' experiments suggest that scaling self-supervision is crucial but there is still a long way to go before definitively surpassing supervised pre-training
Tables
  • Table1: Table 1
  • Table2: A list of self-supervised pre-training datasets used in this work. We train AlexNet [<a class="ref-link" id="c39" href="#r39">39</a>] and ResNet-50 [<a class="ref-link" id="c31" href="#r31">31</a>] on these datasets
  • Table3: ResNet-50 top-1 center-crop accuracy for linear classification on Places205 dataset (§ 6.1). Numbers with use a different fine-tuning procedure. All other models follow the setup from Zhang et al [<a class="ref-link" id="c80" href="#r80">80</a>]
  • Table4: AlexNet top-1 center-crop accuracy for linear classification on Places205 dataset (§ 6.1). Numbers for [<a class="ref-link" id="c52" href="#r52">52</a>, <a class="ref-link" id="c79" href="#r79">79</a>] are from [<a class="ref-link" id="c80" href="#r80">80</a>]. Numbers with use a different fine-tuning schedule
  • Table5: ResNet-50 Linear SVMs mAP on VOC07 classification (§ 6.1)
  • Table6: Detection mAP for frozen conv body on VOC07 and VOC07+12 using Fast R-CNN with ResNet-50-C4 (mean and std computed over 5 trials). We freeze the conv body for all models. Numbers with ∗ use Detectron [<a class="ref-link" id="c28" href="#r28">28</a>] default training schedule. All other models use slightly longer training schedule (see § 6.4)
  • Table7: Surface Normal Estimation on the NYUv2 dataset. We train ResNet-50 from res5 onwards and freeze the conv body below (§ 6.5)
  • Table8: Detection mAP for full fine-tuning on VOC07 and VOC07+12 using Fast R-CNN with ResNet-50-C4 (mean and std computed over 5 trials) (§7). Numbers with ∗ use Detectron [<a class="ref-link" id="c28" href="#r28">28</a>] default training schedule
  • Table9: AlexNet top-1 center-crop accuracy for linear classification on ImageNet-1k. Numbers for [<a class="ref-link" id="c52" href="#r52">52</a>, <a class="ref-link" id="c79" href="#r79">79</a>] are from [<a class="ref-link" id="c80" href="#r80">80</a>]. Numbers with use a different fine-tuning schedule
  • Table10: AlexNet architecture used for Jigsaw pretext task. X spatial resolution of layer, C number of channels in layer; K conv or pool kernel size; S computation stride; D kernel dilation; P padding; G group convolution, last layer is removed during transfer evaluation. Number with * depends on the size per permutation set used to train jigsaw puzzle
  • Table11: AlexNet architecture used for Colorization pretext task. X spatial resolution of layer, C number of channels in layer; K conv or pool kernel size; S computation stride; D kernel dilation; P padding; G group convolution, last layer is removed during transfer evaluation. Number with * depends on the colorization bin size
  • Table12: ResNet-50 architecture used for Jigsaw pretext task. X spatial resolution of layer, C number of channels in layer; K conv or pool kernel size; S computation stride; D kernel dilation; P padding; G group convolution. Layers denoted with res prefix represent the bottleneck residual block. Number with * use the original setting as in [<a class="ref-link" id="c31" href="#r31">31</a>]. Layer with † is implemented as a conv layer. Number with depend on the size of permutation set used for training Jigsaw model (see Section 4.3 in main paper)
  • Table13: ResNet-50 architecture used for Colorization pretext task. X spatial resolution of layer, C number of channels in layer; K conv or pool kernel size; S computation stride; D kernel dilation; P padding; G group convolution. Layers denoted with res prefix represent the bottleneck residual block. Number with * use the original setting as in [<a class="ref-link" id="c31" href="#r31">31</a>]. Layer with † is implemented as a conv layer
  • Table14: AlexNet architecture used for Colorization finetuning. X spatial resolution of layer, C number of channels in layer; K conv or pool kernel size; S computation stride; D kernel dilation; P padding; G group convolution, last layer is removed during transfer evaluation. Number with * depends on the colorization bin size. For evaluation, we downsample conv layers so that the resulting feature map has dimension 9k. Xd downsampled spatial resolution; Kd kernel size of downsample avgpool layer; Sd stride of downsample avgpool layer; Pd padding of downsample using avgpool layer
  • Table15: AlexNet architecture used for Jigsaw finetuning. X spatial resolution of layer, C number of channels in layer; K conv or pool kernel size; S computation stride; D kernel dilation; P padding; G group convolution, last layer is removed during transfer evaluation. Number with * depends on the colorization bin size. For evaluation, we downsample conv layers so that the resulting feature map has dimension 9k. Xd downsampled spatial resolution; Kd kernel size of downsample avgpool layer; Sd stride of downsample avgpool layer; Pd padding of downsample avgpool layer
  • Table16: ResNet-50 architecture used for Jigsaw Transfer task. X spatial resolution of layer, C number of channels in layer; K conv or pool kernel size; S computation stride; D kernel dilation; P padding; G group convolution. Layers denoted with res prefix represent the bottleneck residual block. Number with * use the original setting as in [<a class="ref-link" id="c31" href="#r31">31</a>]. Layer with † depend on the number of output classes. For evaluation, we downsample conv layers so that the resulting feature map has dimension 9k. Xd downsampled spatial resolution; Kd kernel size of downsample avgpool layer; Sd stride of downsample avgpool layer; Pd padding of downsample using avgpool layer
  • Table17: ResNet-50 architecture used for Colorization Transfer task. X spatial resolution of layer, C number of channels in layer; K conv or pool kernel size; S computation stride; D kernel dilation; P padding; G group convolution. Layers denoted with res prefix represent the bottleneck residual block. Number with * use the original setting as in [<a class="ref-link" id="c31" href="#r31">31</a>]. Layer with † depend on the number of output classes. For evaluation, we downsample conv layers so that the resulting feature map has dimension 9k. Xd downsampled spatial resolution; Kd kernel size of downsample avgpool layer; Sd stride of downsample avgpool layer; Pd padding of downsample using avgpool layer
  • Table18: ResNet-50 top-1 center-crop accuracy for linear classification on the ImageNet-1k dataset.. Numbers with † are with 10 − 20× longer fine-tuning and are reported on unofficial ImageNet-1k validation split. Numbers with use different fine-tuning procedure. All other models follow the setup from Zhang et al [<a class="ref-link" id="c80" href="#r80">80</a>]
  • Table19: AlexNet linear SVM classification on the VOC07 dataset
  • Table20: Linear SVM classification on the COCO2014 dataset
  • Table21: ResNet-50 Full fine-tuning image classification (mAP scores)
  • Table22: AlexNet Full fine-tuning image classification (mAP scores) for VOC07: We report 10-crop numbers as in [<a class="ref-link" id="c80" href="#r80">80</a>]. Method with † uses a different fine-tuning schedule, uses weight re-scaling, ∗ we could not determine exact fine-tuning details. Numbers with ‡ taken from [<a class="ref-link" id="c80" href="#r80">80</a>]. We note that drawing consistent comparisons with (and among) prior work is difficult because of differences in the fine-tuning procedure and thus present these results only for the sake of completeness
  • Table23: Surface Normal Estimation on the NYUv2 dataset. We train ResNet-50 from res5 onwards and freeze the conv body below
  • Table24: Detection mAP for frozen conv body on VOC07 and VOC07+12 using Faster R-CNN with ResNet-50-C4. We freeze the conv body for all models
  • Table25: Detection mAP with full fine-tuning on VOC07 and VOC07+12 using Faster R-CNN with ResNet-50-C4. We freeze the conv body for all models
  • Table26: Varying number of patches N for a ResNet-50 on Jigsaw. We increase the problem complexity of the Jigsaw method by increasing the number of patches from 9 (default in [<a class="ref-link" id="c52" href="#r52">52</a>]) to 16. We keep the size of the permutation set fixed at |P| = 2000. We report the performance of training a linear SVM on the fixed features for the VOC07 image classification task. We do not see an improvement by increasing the number of patches
  • Table27: Varying number of colorbins |Q| for a ResNet-50 on Colorization. We increase the problem complexity for the Colorization method by increasing the number of colors (|Q|) the ConvNet must predict. We evaluate the feature representation by training linear classifiers on the fixed features. We report the top-1 center crop accuracy on the Places205 dataset
Download tables as Excel
Related work
  • Visual representation learning without supervision is an old and active area of research. It has two common modeling approaches: generative and discriminative. A generative approach tries to model the data distribution directly. This can be modeled as maximizing the probability of reconstructing the input [47, 55, 72] and optionally estimating latent variables [32, 63] or using adversarial training [17, 48]. Our work focuses on discriminative learning.

    One form of discriminative learning combines clustering with hand-crafted features to learn visual representations such as image-patches [15, 67], object discovery [62, 68]. We focus on discriminative approaches that learn representations directly from the the visual input. A large portion of such approaches are grouped under the term ‘selfsupervised’ learning [11] in which the key principle is to automatically generate ‘labels’ from the data. The label generation can either be domain agnostic [7, 9, 56, 77] or exploit structural properties of the domain, e.g., spatial structure of images [14]. We explore the ‘pretext’ tasks [14] that exploit structural information of the visual data to learn representations. These approaches can broadly be divided into two types - methods that use multi-modal information, e.g. sound [57] and methods that use only the visual data (images, videos). Multi-modal information such as depth from a sensor [19], sound in a video [4, 5, 25, 57], sensors on an autonomous vehicle [3, 33, 84] etc. can be used to automatically learn visual representations without human supervision. One can also use the temporal structure in a video for self-supervised methods [23, 30, 45, 50, 51]. Videos can provide information about how objects move [58], the relation between viewpoints [74, 75] etc.
Reference
  • CSAILVision Segmentation. https://github.com/ CSAILVision/semantic-segmentation-pytorch. Accessed:2019-03-20.
    Findings
  • The Gelato Bet. https://people.eecs.berkeley.edu/̃efros/gelato_bet.html. Accessed:2019-03-20.
    Findings
  • P. Agrawal, J. Carreira, and J. Malik. Learning to see by moving. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • R. Arandjelovic and A. Zisserman. Look, listen and learn. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • R. Arandjelovic and A. Zisserman. Objects that sound. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • A. Bansal, B. Russell, and A. Gupta. Marr revisited: 2d-3d alignment via surface normal prediction. In CVPR, pages 5965–5974, 2016.
    Google ScholarLocate open access versionFindings
  • P. Bojanowski and A. Joulin. Unsupervised learning by predicting noise. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152. ACM, 1992.
    Google ScholarLocate open access versionFindings
  • M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
    Findings
  • V. R. de Sa. Learning classification with unlabeled data. In NIPS, 1994.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009.
    Google ScholarFindings
  • A. Deshpande, J. Rock, and D. Forsyth. Learning large-scale automatic image colorization. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, pages 1422–1430, 2015.
    Google ScholarLocate open access versionFindings
  • C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes paris look like paris? ACM Transactions on Graphics, 31(4), 2012.
    Google ScholarLocate open access versionFindings
  • C. Doersch and A. Zisserman. Multi-task self-supervised visual learning. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • J. Donahue, P. Krahenbuhl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
    Findings
  • A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. TPAMI, 38(9):1734–1747, 2016.
    Google ScholarLocate open access versionFindings
  • D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, Jan. 2015.
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2), 2010.
    Google ScholarLocate open access versionFindings
  • R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. JMLR, 9:1871–1874, 2008.
    Google ScholarLocate open access versionFindings
  • B. Fernando, H. Bilen, E. Gavves, and S. Gould. Selfsupervised video representation learning with odd-one-out networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3d primitives for single image understanding. In ICCV, 2013.
    Google ScholarLocate open access versionFindings
  • R. Gao, R. Feris, and K. Grauman. Learning to separate object sounds by watching unlabeled video. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
    Findings
  • R. Girshick. Fast r-cnn. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollar, and K. He. Detectron, 2018.
    Google ScholarLocate open access versionFindings
  • P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
    Findings
  • R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • F. J. Huang, Y.-L. Boureau, Y. LeCun, et al. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007.
    Google ScholarFindings
  • D. Jayaraman and K. Grauman. Learning image representations tied to ego-motion. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
    Findings
  • A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. Learning visual features from large weakly supervised data. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • A. Kolesnikov, X. Zhai, and L. Beyer. Revisiting selfsupervised visual representation learning. arXiv preprint arXiv:1901.09005, 2019.
    Findings
  • P. Krahenbuhl, C. Doersch, J. Donahue, and T. Darrell. Datadependent initializations of convolutional neural networks. arXiv preprint arXiv:1511.06856, 2015.
    Findings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • L. Ladicky, B. Zeisl, and M. Pollefeys. Discriminatively trained dense surface normal estimation. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • G. Larsson, M. Maire, and G. Shakhnarovich. Learning representations for automatic colorization. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • G. Larsson, M. Maire, and G. Shakhnarovich. Colorization as a proxy task for visual understanding. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1, 1989.
    Google ScholarFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV. 2014.
    Google ScholarLocate open access versionFindings
  • P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. LeCun. Predicting deeper into the future of semantic segmentation. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • J. Masci, U. Meier, D. Ciresan, and J. Schmidhuber. Stacked convolutional auto-encoders for hierarchical feature extraction. In International Conference on Artificial Neural Networks, pages 52–59.
    Google ScholarLocate open access versionFindings
  • L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Crossstitch networks for multi-task learning. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in video. In ICML, 2009.
    Google ScholarLocate open access versionFindings
  • M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • M. Noroozi, H. Pirsiavash, and P. Favaro. Representation learning by learning to count. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash. Boosting self-supervised learning via knowledge transfer. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607, 1996.
    Google ScholarLocate open access versionFindings
  • A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
    Findings
  • A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound provides supervision for visual learning. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • D. Pathak, R. Girshick, P. Dollar, T. Darrell, and B. Hariharan. Learning features by watching objects move. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun. Megdet: A large mini-batch object detector. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115, 2015.
    Google ScholarLocate open access versionFindings
  • B. C. Russell, W. T. Freeman, A. A. Efros, J. Sivic, and A. Zisserman. Using multiple segmentations to discover objects and their extent in image collections. In CVPR, 2006.
    Google ScholarLocate open access versionFindings
  • R. Salakhutdinov and G. Hinton. Deep boltzmann machines. In Artificial intelligence and statistics, pages 448–455, 2009.
    Google ScholarLocate open access versionFindings
  • A. Sax, B. Emi, A. R. Zamir, L. Guibas, S. Savarese, and J. Malik. Mid-level visual representations improve generalization and sample efficiency for learning active tasks. arXiv preprint arXiv:1812.11971, 2018.
    Findings
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV. Springer, 2012.
    Google ScholarFindings
  • S. Singh, A. Gupta, and A. A. Efros. Unsupervised discovery of mid-level discriminative patches. In ECCV. 2012.
    Google ScholarLocate open access versionFindings
  • J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their location in images. In ICCV, 2005.
    Google ScholarLocate open access versionFindings
  • C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. arXiv preprint arXiv:1503.01817, 2015.
    Findings
  • J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 104(2):154–171, 2013.
    Google ScholarLocate open access versionFindings
  • P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML. ACM, 2008.
    Google ScholarFindings
  • X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • X. Wang, K. He, and A. Gupta. Transitive invariance for selfsupervised visual representation learning. In ICCV, pages 1329–1338, 2017.
    Google ScholarLocate open access versionFindings
  • Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese. Gibson env: Real-world perception for embodied agents. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba. Semantic understanding of scenes through the ade20k dataset. IJCV, 2018.
    Google ScholarLocate open access versionFindings
  • T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
    Google ScholarFindings
  • 1. AlexNet Colorization: We strictly follow [79]. Specifically, we train on 8-gpus, use minibatch size of
    Google ScholarFindings
  • 2. AlexNet Jigsaw: We follow the settings above and train on 8-gpus, use minibatch size of 256, initial learning rate (LR) of 0.01 with the learning rate dropped by factor of 10 at certain interval. We use momentum of 0.9, weight decay 5e-4 and SpatialBN weight decay 0. We apply bias decay for the bias parameter of the model and we do not apply weight decay to the scale and bias parameter of SpatialBN layers. We train for 140k iterations total and use the learning rate schedule of 40k/40k/40k/20. We read the input image, convert it to Lab space, resize the shorter side to 256, randomly crop 227x227 image and apply horizontal flip with 50% probability.
    Google ScholarFindings
  • 1. AlexNet Jigsaw: We strictly follow [79]. Specifically, we train for 80k iterations on 1-gpu using minibatch size of 16, initial learning rate (LR) of 0.001 with the learning rate dropped by factor of 10 after 10K iterations. We use momentum of 0.9, weight decay 1e-6 and SpatialBN weight decay 1e-4. We do not apply bias decay for the bias parameter of the model and we also do not apply weight decay to the scale and bias parameter of SpatialBN layers. We read the input image, randomly crop 227x227 image and apply horizontal flip with 50% probability.
    Google ScholarFindings
  • 2. AlexNet Colorization: We follow settings above and train for 80k iterations on 1-gpu using minibatch size of 16, initial learning rate (LR) of 0.005 with the learning rate dropped by factor of 10 after 10K iterations. We use momentum of 0.9, weight decay 1e-6 and SpatialBN weight decay 0. We do not apply bias decay for the bias parameter of the model. We read the input image, convert it to Lab, randomly crop 227x227 image and apply horizontal flip with 50% probability.
    Google ScholarFindings
  • [14] AlexNet (SplitBrain)
    Google ScholarFindings
  • [80] AlexNet (Counting)
    Google ScholarFindings
  • [53] AlexNet (Rotation) [26],∗ AlexNet (DeepCluster) [9]† AlexNet Jigsaw ImageNet-1k AlexNet Jigsaw ImageNet-22k AlexNet Jigsaw YFCC-100M AlexNet Coloriz. ImageNet-1k AlexNet Coloriz. ImageNet-22k AlexNet Coloriz. YFCC-100M
    Google ScholarFindings
Your rating :
0

 

Tags
Comments