Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

arXiv: Computer Vision and Pattern Recognition, 2019.

Cited by: 65|Bibtex|Views101|Links
EI WOS
Keywords:
semantic segmentationcomputer vision applicationlearning methodgeneral visual featureStructural Similarity IndexMore(27+)
Weibo:
This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos

Abstract:

Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised lear...More

Code:

Data:

0
Introduction
  • 1.1 Motivation

    D UE to the powerful ability to learn different levels of general visual features, deep neural networks have been used as the basic structure to many computer vision applications such as object detection [1], [2], [3], semantic segmentation [4], [5], [6], image captioning [7], etc.
  • Various networks including AlexNet [8], VGG [9], GoogLeNet [10], ResNet [11], and DenseNet [12] and.
  • Large scale datasets such as ImageNet [13], OpenImage [14] have been proposed to train very deep ConvNets.
  • With the sophisticated architectures and large-scale datasets, the performance of ConvNets keeps breaking the state-of-thearts for many computer vision tasks [1], [4], [7], [15], [16].
  • AVTS [25] AudioVisual [26] LookListenLearn [93] AmbientSound [154] EgoMotion [155] LearnByMove [94] TiedEgoMotion [95]
Highlights
  • D UE to the powerful ability to learn different levels of general visual features, deep neural networks have been used as the basic structure to many computer vision applications such as object detection [1], [2], [3], semantic segmentation [4], [5], [6], image captioning [7], etc
  • The models trained from large-scale image datasets like ImageNet are widely used as the pre-trained models and fine-tuned for other tasks for two main reasons: (1) the parameters learned from large-scale diverse datasets provide a good starting point, networks training on other tasks can converge faster, (2) the network trained on large-scale datasets already learned the hierarchy features which can help to reduce over-fitting problem during the training of other tasks, especially when datasets of other tasks are small or training labels are scarce
  • This review only focuses on self-supervised learning methods for visual feature learning with ConvNets in which the features can be transferred to multiple different computer vision tasks
  • Self-supervised image feature learning with deep convolution neural network has obtained great success and the margin between the performance of self-supervised methods and that of supervised methods on some downstream tasks becomes very small
  • With a supervised pre-trained model on the large-scale Kinetics dataset (500, 000 videos of 600 classes) with humanannotated class labels and fine-tuned on UCF101 dataset, the performance can increase to 84%
  • This paper has extensively reviewed recently deep convolution neural network-based methods for self-supervised image and video feature learning from all perspectives including common network architectures, pretext tasks, algorithms, datasets, performance comparison, discussions, and future directions etc
Methods
  • GAN [83] DCGAN [120] WGAN [121] BiGAN [122] SelfGAN [123] ColorfulColorization [18] Colorization [82] AutoColor [124] Split-Brain [42] Context Encoder [19] CompletNet [125] SRGAN [15] SpotArtifacts [126]

    ImproveContext [33] Context Prediction [41]

    Jigsaw [20] Damaged Jigsaw [89] Arbitrary Jigsaw [88] DeepPermNet [127]

    RotNet [36] Boosting [34] JointCluster [128] DeepCluster [44] ClusterEmbegging [129] GraphConstraint [43] Ranking [38] PredictNoise [46] MultiTask [32] Learning2Count [130]

    Watching Move [81] Edge Detection [81] Cross Domain [81] Category

    Generation Generation Generation Generation

    Multiple Generation Generation Generation Generation Generation Generation Generation Generation

    Context Context Context Multiple Context Context Context Multiple Context Context Context Context Context Context Multiple Context

    Free Semantic Label Free Semantic Label Free Semantic Label Code.
  • Places labels [8] ImageNet labels [8] Random(Scratch) [8] ColorfulColorization [18] BiGAN [122] SplitBrain [42] ContextEncoder [19] ContextPrediction [41] Jigsaw [20] Learning2Count [130] DeepClustering [44].
  • Labels [8] Random(Scratch) [8] ContextEncoder [19] BiGAN [122] ColorfulColorization [18] SplitBrain [42] RankVideo [38].
  • PredictNoise [46] JigsawPuzzle [20] ContextPrediction [41] Learning2Count [130] DeepClustering [44] WatchingVideo [81] CrossDomain [30] AmbientSound [154] TiedToEgoMotion [95] EgoMotion [94]
Results
  • Evaluation Metrics

    Another fact is that more evaluation metrics are needed to evaluate the quality of the learned features in different levels.
  • The current solution is to use the performance on downstream tasks to indicate the quality of the features
  • This evaluation metric does not give insight what the network learned through the selfsupervsied pre-training.
  • More evaluation metrics such as network dissection [78] should be employed to analysis the interpretability of the self-supervised learned features.
  • Some future directions of self-supervised learning are discussed
Conclusion
  • Self-supervised image feature learning with deep convolution neural network has obtained great success and the margin between the performance of self-supervised methods and that of supervised methods on some downstream tasks becomes very small.
  • This paper has extensively reviewed recently deep convolution neural network-based methods for self-supervised image and video feature learning from all perspectives including common network architectures, pretext tasks, algorithms, datasets, performance comparison, discussions, and future directions etc.
  • The comparative summary of the methods, datasets, and performance in tabular forms clearly demonstrate their properties which will benefit researchers in the computer vision community
Summary
  • Introduction:

    1.1 Motivation

    D UE to the powerful ability to learn different levels of general visual features, deep neural networks have been used as the basic structure to many computer vision applications such as object detection [1], [2], [3], semantic segmentation [4], [5], [6], image captioning [7], etc.
  • Various networks including AlexNet [8], VGG [9], GoogLeNet [10], ResNet [11], and DenseNet [12] and.
  • Large scale datasets such as ImageNet [13], OpenImage [14] have been proposed to train very deep ConvNets.
  • With the sophisticated architectures and large-scale datasets, the performance of ConvNets keeps breaking the state-of-thearts for many computer vision tasks [1], [4], [7], [15], [16].
  • AVTS [25] AudioVisual [26] LookListenLearn [93] AmbientSound [154] EgoMotion [155] LearnByMove [94] TiedEgoMotion [95]
  • Methods:

    GAN [83] DCGAN [120] WGAN [121] BiGAN [122] SelfGAN [123] ColorfulColorization [18] Colorization [82] AutoColor [124] Split-Brain [42] Context Encoder [19] CompletNet [125] SRGAN [15] SpotArtifacts [126]

    ImproveContext [33] Context Prediction [41]

    Jigsaw [20] Damaged Jigsaw [89] Arbitrary Jigsaw [88] DeepPermNet [127]

    RotNet [36] Boosting [34] JointCluster [128] DeepCluster [44] ClusterEmbegging [129] GraphConstraint [43] Ranking [38] PredictNoise [46] MultiTask [32] Learning2Count [130]

    Watching Move [81] Edge Detection [81] Cross Domain [81] Category

    Generation Generation Generation Generation

    Multiple Generation Generation Generation Generation Generation Generation Generation Generation

    Context Context Context Multiple Context Context Context Multiple Context Context Context Context Context Context Multiple Context

    Free Semantic Label Free Semantic Label Free Semantic Label Code.
  • Places labels [8] ImageNet labels [8] Random(Scratch) [8] ColorfulColorization [18] BiGAN [122] SplitBrain [42] ContextEncoder [19] ContextPrediction [41] Jigsaw [20] Learning2Count [130] DeepClustering [44].
  • Labels [8] Random(Scratch) [8] ContextEncoder [19] BiGAN [122] ColorfulColorization [18] SplitBrain [42] RankVideo [38].
  • PredictNoise [46] JigsawPuzzle [20] ContextPrediction [41] Learning2Count [130] DeepClustering [44] WatchingVideo [81] CrossDomain [30] AmbientSound [154] TiedToEgoMotion [95] EgoMotion [94]
  • Results:

    Evaluation Metrics

    Another fact is that more evaluation metrics are needed to evaluate the quality of the learned features in different levels.
  • The current solution is to use the performance on downstream tasks to indicate the quality of the features
  • This evaluation metric does not give insight what the network learned through the selfsupervsied pre-training.
  • More evaluation metrics such as network dissection [78] should be employed to analysis the interpretability of the self-supervised learned features.
  • Some future directions of self-supervised learning are discussed
  • Conclusion:

    Self-supervised image feature learning with deep convolution neural network has obtained great success and the margin between the performance of self-supervised methods and that of supervised methods on some downstream tasks becomes very small.
  • This paper has extensively reviewed recently deep convolution neural network-based methods for self-supervised image and video feature learning from all perspectives including common network architectures, pretext tasks, algorithms, datasets, performance comparison, discussions, and future directions etc.
  • The comparative summary of the methods, datasets, and performance in tabular forms clearly demonstrate their properties which will benefit researchers in the computer vision community
Tables
  • Table1: Summary of commonly used image and video datasets. Note that image datasets can be used to learn image features, while video datasets can be used to learn both image and video features
  • Table2: Summary of self-supervised image feature learning methods based on the category of pretext tasks. Multi-task means the method explicitly or implicitly uses multiple pretext tasks for image feature learning
  • Table3: Summary of self-supervised video feature learning methods based on the category of pretext tasks
  • Table4: Linear classification on ImageNet and Places datasets using activations from the convolutional layers of an AlexNet as features. ”Convn” means the linear classifier is trained based on the n-th convolution layer of AlexNet. ”Places Labels” and ”ImageNet Labels” indicate using supervised model trained with human-annotated labels as the pre-trained model
  • Table5: Comparison of the self-supervised image feature learning methods on classification, detection, and segmentation on PASCAL VOC dataset
  • Table6: Comparison of the existing self-supervised methods for action recognition on the UCF101 and HMDB51 datasets. * indicates the average accuracy over three splits. ”Kinetics Labels” indicates using supervised model trained with human-annotated labels as the pre-trained model
Download tables as Excel
Funding
  • This material is based upon work supported by the National Science Foundation under award number IIS-1400802
Reference
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, pp. 580–587, 2014.
    Google ScholarLocate open access versionFindings
  • R. Girshick, “Fast R-CNN,” in ICCV, 2015.
    Google ScholarFindings
  • S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards realtime object detection with region proposal networks,” in NIPS, pp. 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, pp. 3431–3440, 2015.
    Google ScholarLocate open access versionFindings
  • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” TPAMI, 2018.
    Google ScholarLocate open access versionFindings
  • H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR, pp. 2881–2890, 2017.
    Google ScholarLocate open access versionFindings
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in CVPR, pp. 3156–3164, 2015.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, pp. 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, pp. 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in CVPR, vol. 1, p. 3, 2017.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, pp. 248–255, IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. PontTuset, S. Kamali, S. Popov, M. Malloci, and T. Duerig, “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” arXiv preprint arXiv:1811.00982, 2018.
    Findings
  • C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in CVPR.
    Google ScholarFindings
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in ICCV, 2015.
    Google ScholarFindings
  • W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
    Findings
  • R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in ECCV, pp. 649–666, Springer, 2016.
    Google ScholarLocate open access versionFindings
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in CVPR, pp. 2536–2544, 2016.
    Google ScholarLocate open access versionFindings
  • M. Noroozi and P. Favaro, “Unsupervised learning of visual representions by solving jigsaw puzzles,” in ECCV, 2016.
    Google ScholarFindings
  • D. Mahajan, R. B. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten, “Exploring the limits of weakly supervised pretraining,” in ECCV, pp. 185–201, 2018.
    Google ScholarLocate open access versionFindings
  • W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool, “Webvision database: Visual learning and understanding from web data,” arXiv preprint arXiv:1708.02862, 2017.
    Findings
  • A. Mahendran, J. Thewlis, and A. Vedaldi, “Cross pixel optical flow similarity for self-supervised learning,” arXiv preprint arXiv:1807.05636, 2018.
    Findings
  • N. Sayed, B. Brattoli, and B. Ommer, “Cross and learn: Crossmodal self-supervision,” arXiv preprint arXiv:1811.03879, 2018.
    Findings
  • B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of audio and video models from self-supervised synchronization,” in NIPS, pp. 7773–7784, 2018.
    Google ScholarLocate open access versionFindings
  • A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” arXiv preprint arXiv:1804.03641, 2018.
    Findings
  • D. Kim, D. Cho, and I. S. Kweon, “Self-supervised video representation learning with space-time cubic puzzles,” arXiv preprint arXiv:1811.09795, 2018.
    Findings
  • L. Jing and Y. Tian, “Self-supervised spatiotemporal feature learning by video geometric transformations,” arXiv preprint arXiv:1811.11387, 2018.
    Findings
  • B. Fernando, H. Bilen, E. Gavves, and S. Gould, “Self-supervised video representation learning with odd-one-out networks,” in CVPR, 2017.
    Google ScholarFindings
  • Z. Ren and Y. J. Lee, “Cross-domain self-supervised multi-task feature learning using synthetic imagery,” in CVPR, 2018.
    Google ScholarFindings
  • X. Wang, K. He, and A. Gupta, “Transitive invariance for selfsupervised visual representation learning,” in ICCV, 2017.
    Google ScholarFindings
  • C. Doersch and A. Zisserman, “Multi-task self-supervised visual learning,” in ICCV, 2017.
    Google ScholarFindings
  • T. N. Mundhenk, D. Ho, and B. Y. Chen, “Improvements to context based self-supervised learning,” in CVPR, 2018.
    Google ScholarFindings
  • M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, “Boosting self-supervised learning via knowledge transfer,” arXiv preprint arXiv:1805.00385, 2018.
    Findings
  • U. Buchler, B. Brattoli, and B. Ommer, “Improving spatiotemporal self-supervision by deep reinforcement learning,” in ECCV, pp. 770–786, 2018.
    Google ScholarLocate open access versionFindings
  • S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” in ICLR, 2018.
    Google ScholarFindings
  • N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised Learning of Video Representations using LSTMs,” in ICML, 2015.
    Google ScholarFindings
  • X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in ICCV, 2015.
    Google ScholarFindings
  • H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised representation learning by sorting sequences,” in ICCV, pp. 667– 676, IEEE, 2017.
    Google ScholarFindings
  • I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervised learning using temporal order verification,” in ECCV, pp. 527–544, Springer, 2016.
    Google ScholarLocate open access versionFindings
  • C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in ICCV, pp. 1422– 1430, 2015.
    Google ScholarLocate open access versionFindings
  • R. Zhang, P. Isola, and A. A. Efros, “Split-brain autoencoders: Unsupervised learning by cross-channel prediction,” in CVPR, 2017.
    Google ScholarFindings
  • D. Li, W.-C. Hung, J.-B. Huang, S. Wang, N. Ahuja, and M.-H. Yang, “Unsupervised visual representation learning by graphbased consistent constraints,” in ECCV, 2016.
    Google ScholarFindings
  • M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in ECCV, 2018.
    Google ScholarFindings
  • E. Hoffer, I. Hubara, and N. Ailon, “Deep unsupervised learning through spatial contrasting,” arXiv preprint arXiv:1610.00243, 2016.
    Findings
  • P. Bojanowski and A. Joulin, “Unsupervised learning by predicting noise,” arXiv preprint arXiv:1704.05310, 2017.
    Findings
  • Y. Li, M. Paluri, J. M. Rehg, and P. Dollar, “Unsupervised learning of edges,” CVPR, pp. 1619–1627, 2016.
    Google ScholarLocate open access versionFindings
  • S. Purushwalkam and A. Gupta, “Pose from action: Unsupervised learning of pose features based on motion,” arXiv preprint arXiv:1609.05420, 2016.
    Findings
  • J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the fisher vector: Theory and practice,” IJCV, vol. 105, no. 3, pp. 222–245, 2013.
    Google ScholarLocate open access versionFindings
  • A. Faktor and M. Irani, “Video segmentation by non-local consensus voting.,” in BMVC, vol. 2, 2014.
    Google ScholarLocate open access versionFindings
  • O. Stretcu and M. Leordeanu, “Multiple frames matching for object discovery in video.,” in BMVC, vol. 1, 2015.
    Google ScholarLocate open access versionFindings
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016.
    Findings
  • K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015.
    Google ScholarFindings
  • C. Feichtenhofer, A. Pinz, and R. Wildes, “Spatiotemporal residual networks for video action recognition,” in NIPS, pp. 3468– 3476, 2016.
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal multiplier networks for video action recognition,” in CVPR.
    Google ScholarFindings
  • C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional twostream network fusion for video action recognition,” in CVPR, pp. 1933–1941, 2016.
    Google ScholarLocate open access versionFindings
  • L. Wang, Y. Qiao, and X. Tang, “Action recognition with trajectory-pooled deep-convolutional descriptors,” in CVPR, pp. 4305–4314, 2015.
    Google ScholarLocate open access versionFindings
  • L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, “Towards good practices for very deep two-stream convnets,” arXiv preprint arXiv:1507.02159, 2015.
    Findings
  • H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, “Dynamic image networks for action recognition,” in CVPR, 2016.
    Google ScholarFindings
  • L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: towards good practices for deep action recognition,” in ECCV, 2016.
    Google ScholarFindings
  • S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” TPAMI, vol. 35, no. 1, pp. 221–231, 2013.
    Google ScholarLocate open access versionFindings
  • K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” CRCV-TR, vol. 12-01, 2012.
    Google ScholarLocate open access versionFindings
  • X. Peng, L. Wang, X. Wang, and Y. Qiao, “Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice,” CVIU, vol. 150, pp. 109–125, 2016.
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Bags of spacetime energies for dynamic scene recognition,” in CVPR, pp. 2681–2688, 2014.
    Google ScholarLocate open access versionFindings
  • X. Ren and M. Philipose, “Egocentric recognition of handled objects: Benchmark and analysis,” in CVPRW, pp. 1–8, IEEE, 2009.
    Google ScholarFindings
  • G. Varol, I. Laptev, and C. Schmid, “Long-term Temporal Convolutions for Action Recognition,” TPAMI, 2017.
    Google ScholarLocate open access versionFindings
  • L. Jing, X. Yang, and Y. Tian, “Video you only look once: Overall temporal convolutions for action recognition,” JVCIR, vol. 52, pp. 58–65, 2018.
    Google ScholarLocate open access versionFindings
  • J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, pp. 4724–4733, IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet,” in CVPR, pp. 18–22, 2018.
    Google ScholarLocate open access versionFindings
  • Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in ICCV, 2017.
    Google ScholarFindings
  • Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond Short Snippets: Deep Networks for Video Classification,” in CVPR, 2015.
    Google ScholarFindings
  • Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “Videolstm convolves, attends and flows for action recognition,” CVIU, vol. 166, pp. 41–50, 2018.
    Google ScholarLocate open access versionFindings
  • S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence – video to text,” in ICCV, 2015.
    Google ScholarFindings
  • P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network,” in CVPR, pp. 4207–4215, 2016.
    Google ScholarLocate open access versionFindings
  • D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network dissection: Quantifying interpretability of deep visual representations,” in CVPR, 2017.
    Google ScholarFindings
  • D. Bau, J.-Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman, and A. Torralba, “Gan dissection: Visualizing and understanding generative adversarial networks,” arXiv preprint arXiv:1811.10597, 2018.
    Findings
  • M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in ECCV, pp. 818–833, Springer, 2014.
    Google ScholarFindings
  • D. Pathak, R. Girshick, P. Dollar, T. Darrell, and B. Hariharan, “Learning features by watching objects move,” in CVPR, vol. 2, 2017.
    Google ScholarLocate open access versionFindings
  • G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a proxy task for visual understanding,” in CVPR, 2017.
    Google ScholarFindings
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, pp. 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-toimage translation using cycle-consistent adversarial networks,” in ICCV, 2017.
    Google ScholarFindings
  • C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” in NIPS, pp. 613–621, 2016.
    Google ScholarLocate open access versionFindings
  • S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “Mocogan: Decomposing motion and content for video generation,” CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • U. Ahsan, R. Madhok, and I. Essa, “Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition,” arXiv preprint arXiv:1808.07507, 2018.
    Findings
  • C. Wei, L. Xie, X. Ren, Y. Xia, C. Su, J. Liu, Q. Tian, and A. L. Yuille, “Iterative reorganization with weak spatial constraints: Solving arbitrary jigsaw puzzles for unsupervised representation learning,” arXiv preprint arXiv:1812.00329, 2018.
    Findings
  • D. Kim, D. Cho, D. Yoo, and I. S. Kweon, “Learning image representations by completing damaged jigsaw puzzles,” arXiv preprint arXiv:1802.01880, 2018.
    Findings
  • D. Wei, J. Lim, A. Zisserman, and W. T. Freeman, “Learning and using the arrow of time,” in CVPR, pp. 8052–8060, 2018.
    Google ScholarLocate open access versionFindings
  • I. Croitoru, S.-V. Bogolin, and M. Leordeanu, “Unsupervised learning from video to detect foreground objects in single images,” arXiv preprint arXiv:1703.10901, 2017.
    Findings
  • H. Jiang, G. Larsson, M. Maire Greg Shakhnarovich, and E. Learned-Miller, “Self-supervised relative depth learning for urban scene understanding,” in ECCV, pp. 19–35, 2018.
    Google ScholarLocate open access versionFindings
  • R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in ICCV, pp. 609–617, IEEE, 2017.
    Google ScholarFindings
  • P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” in ICCV, pp. 37–45, 2015.
    Google ScholarLocate open access versionFindings
  • D. Jayaraman and K. Grauman, “Learning image representations tied to ego-motion,” in ICCV, pp. 1413–1421, 2015.
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” IJCV, vol. 88, no. 2, pp. 303–338, 2010.
    Google ScholarLocate open access versionFindings
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in CVPR, pp. 3213–3223, 2016.
    Google ScholarFindings
  • B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” arXiv preprint arXiv:1608.05442, 2016.
    Findings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, pp. 740–755, Springer, 2014.
    Google ScholarLocate open access versionFindings
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in CVPR, pp. 779– 788, 2016.
    Google ScholarLocate open access versionFindings
  • J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in CVPR, 2017.
    Google ScholarFindings
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in ECCV, pp. 21–37, Springer, 2016.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, P. Dollar, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie, “Feature pyramid networks for object detection.,” in CVPR, vol. 1, p. 4, 2017.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” TPAMI, 2018.
    Google ScholarLocate open access versionFindings
  • J. A. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural processing letters, vol. 9, no. 3, pp. 293– 300, 1999.
    Google ScholarLocate open access versionFindings
  • H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre, “Hmdb51: A large video database for human motion recognition,” in HPCSE, pp. 571–582, Springer, 2013.
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in NIPS, pp. 487–495, 2014.
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, “Places: An image database for deep scene understanding,” arXiv preprint arXiv:1610.02055, 2016.
    Findings
  • S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic scene completion from a single depth image,” in CVPR, pp. 190–198, IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
    Google ScholarLocate open access versionFindings
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPSW, vol. 2011, p. 5, 2011.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” tech. rep., Citeseer, 2009.
    Google ScholarFindings
  • A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223, 2011.
    Google ScholarLocate open access versionFindings
  • B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “The new data and new challenges in multimedia research,” arXiv preprint arXiv:1503.01817, vol. 1, no. 8, 2015.
    Findings
  • J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison, “Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation,” in ICCV, vol. 4, 2017.
    Google ScholarLocate open access versionFindings
  • M. Monfort, B. Zhou, S. A. Bargal, T. Yan, A. Andonian, K. Ramakrishnan, L. Brown, Q. Fan, D. Gutfruend, C. Vondrick, et al., “Moments in time dataset: one million videos for event understanding,”
    Google ScholarFindings
  • J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, pp. 776–780, IEEE, 2017.
    Google ScholarFindings
  • A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, pp. 3354–3361, IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache, “Learning visual features from large weakly supervised data,” in ECCV, pp. 67–84, 2016.
    Google ScholarFindings
  • A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
    Findings
  • M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
    Findings
  • J. Donahue, P. Krahenbuhl, and T. Darrell, “Adversarial feature learning,” arXiv preprint arXiv:1605.09782, 2016.
    Findings
  • T. Chen, X. Zhai, and N. Houlsby, “Self-supervised gan to counter forgetting,” arXiv preprint arXiv:1810.11598, 2018.
    Findings
  • G. Larsson, M. Maire, and G. Shakhnarovich, “Learning representations for automatic colorization,” in ECCV, pp. 577–593, Springer, 2016.
    Google ScholarFindings
  • S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and Locally Consistent Image Completion,” SIGGRAPH, 2017.
    Google ScholarLocate open access versionFindings
  • S. Jenni and P. Favaro, “Self-supervised feature learning by learning to spot artifacts,” arXiv preprint arXiv:1806.05024, 2018.
    Findings
  • R. Santa Cruz, B. Fernando, A. Cherian, and S. Gould, “Visual permutation learning,” TPAMI, 2018.
    Google ScholarLocate open access versionFindings
  • J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in CVPR, pp. 5147– 5156, 2016.
    Google ScholarLocate open access versionFindings
  • J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in ICML, pp. 478–487, 2016.
    Google ScholarFindings
  • M. Noroozi, H. Pirsiavash, and P. Favaro, “Representation learning by learning to count,” in ICCV, 2017.
    Google ScholarFindings
  • G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
    Google ScholarLocate open access versionFindings
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in NIPS, pp. 2234–2242, 2016.
    Google ScholarLocate open access versionFindings
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in NIPS, pp. 6626–6637, 2017.
    Google ScholarLocate open access versionFindings
  • A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018.
    Findings
  • T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” arXiv preprint arXiv:1812.04948, 2018.
    Findings
  • P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” arxiv, 2016.
    Google ScholarFindings
  • R. Zhang, J.-Y. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. Efros, “Real-time user-guided image colorization with learned deep priors,” arXiv preprint arXiv:1705.02999, 2017.
    Findings
  • S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification,” TOG, vol. 35, no. 4, p. 110, 2016.
    Google ScholarLocate open access versionFindings
  • A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM computing surveys (CSUR), vol. 31, no. 3, pp. 264– 323, 1999.
    Google ScholarLocate open access versionFindings
  • N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, vol. 1, pp. 886–893, IEEE, 2005.
    Google ScholarLocate open access versionFindings
  • J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the fisher vector: Theory and practice,” IJCV, vol. 105, no. 3, pp. 222–245, 2013.
    Google ScholarLocate open access versionFindings
  • S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics, 2017.
    Google ScholarFindings
  • A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16, 2017.
    Google ScholarLocate open access versionFindings
  • M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adversarial nets with singular value clipping,” in ICCV, vol. 2, p. 5, 2017.
    Google ScholarLocate open access versionFindings
  • C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy, “Tracking emerges by colorizing videos,” in ECCV, 2018.
    Google ScholarFindings
  • S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in NIPS, pp. 802–810, 2015.
    Google ScholarLocate open access versionFindings
  • R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video sequence prediction,” in ICLR, 2017.
    Google ScholarFindings
  • Z. Luo, B. Peng, D.-A. Huang, A. Alahi, and L. Fei-Fei, “Unsupervised learning of long-term motion dynamics for videos,” in CVPR, 2017.
    Google ScholarFindings
  • B. Brattoli, U. Buchler, A.-S. Wahl, M. E. Schwab, and B. Ommer, “Lstm self-supervision for detailed behavior analysis,” in CVPR, pp. 3747–3756, IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • D. Jayaraman and K. Grauman, “Slow and steady feature analysis: higher order temporal coherence in video,” in CVPR, pp. 3852–3861, 2016.
    Google ScholarLocate open access versionFindings
  • A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in ICCV, pp. 2758–2766, 2015.
    Google ScholarLocate open access versionFindings
  • E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in CVPR, pp. 1647–1655, IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • S. Meister, J. Hur, and S. Roth, “Unflow: Unsupervised learning of optical flow with a bidirectional census loss,” in AAAI, 2018.
    Google ScholarFindings
  • A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba, “Ambient sound provides supervision for visual learning,” in ECCV, pp. 801–816, Springer, 2016.
    Google ScholarLocate open access versionFindings
  • T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in CVPR, vol. 2, p. 7, 2017.
    Google ScholarLocate open access versionFindings
  • Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and camera pose,” in CVPR, vol. 2, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Zou, Z. Luo, and J.-B. Huang, “Df-net: Unsupervised joint learning of depth and flow using cross-task consistency,” in ECCV, pp. 38–55, Springer, 2018.
    Google ScholarLocate open access versionFindings
  • G. Iyer, J. K. Murthy, G. Gupta, K. M. Krishna, and L. Paull, “Geometric consistency for self-supervised end-to-end visual odometry,” arXiv preprint arXiv:1804.03789, 2018.
    Findings
  • Y. Zhang, S. Khamis, C. Rhemann, J. Valentin, A. Kowdle, V. Tankovich, M. Schoenberg, S. Izadi, T. Funkhouser, and S. Fanello, “Activestereonet: End-to-end self-supervised learning for active stereo systems,” in ECCV, pp. 784–801, 2018.
    Google ScholarLocate open access versionFindings
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Deep end2end voxel2voxel prediction,” in CVPRW, pp. 17–24, 2016.
    Google ScholarLocate open access versionFindings
  • M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in ICLR, 2016.
    Google ScholarFindings
  • F. A. Reda, G. Liu, K. J. Shih, R. Kirby, J. Barker, D. Tarjan, A. Tao, and B. Catanzaro, “Sdc-net: Video prediction using spatiallydisplaced convolution,” in ECCV, pp. 718–733, 2018.
    Google ScholarLocate open access versionFindings
  • M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, “Stochastic variational video prediction,” arXiv preprint arXiv:1710.11252, 2017.
    Findings
  • X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion gan for future-flow embedded video prediction,” in ICCV, vol. 1, 2017.
    Google ScholarLocate open access versionFindings
  • C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” in NIPS, pp. 64– 72, 2016.
    Google ScholarLocate open access versionFindings
  • P. Krhenbhl, “Free supervision from video games,” in CVPR, June 2018.
    Google ScholarFindings
  • S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A large-scale video classification benchmark,” arXiv preprint arXiv:1609.08675, 2016.
    Findings
  • L. Gomez, Y. Patel, M. Rusinol, D. Karatzas, and C. Jawahar, “Self-supervised learning of visual features through embedding images into text topic spaces,” in CVPR, IEEE, 2017.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments