Self-Supervised Learning by Cross-Modal Audio-Video Clustering

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views49|Links
EI
Keywords:
andrew y nglarge scaleConcatenation Deep ClusteringMulti-Head Deep Clusteringscale videoMore(13+)
Weibo:
We presented Cross-Modal Deep Clustering, a novel self-supervised model for video and audio

Abstract:

Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and...More

Code:

Data:

0
Introduction
  • Do the authors need to explicitly name the actions of “laughing” or “sneezing” in order to recognize them? Or can the authors learn to visually classify them without labels by associating characteristic sounds with these actions? a wide literature in perceptual studies provides evidence that the authors rely heavily on hearing sounds to make sense of actions and dynamic events in the visual world.
  • Can the authors learn to visually classify them without labels by associating characteristic sounds with these actions?
  • A wide literature in perceptual studies provides evidence that the authors rely heavily on hearing sounds to make sense of actions and dynamic events in the visual world.
  • The motivation for the study stems from two fundamental challenges facing a fully-supervised line of attack to learning video models.
  • The first challenge is the exorbitant cost of scaling up
Highlights
  • Do we need to explicitly name the actions of “laughing” or “sneezing” in order to recognize them? Or can we learn to visually classify them without labels by associating characteristic sounds with these actions? a wide literature in perceptual studies provides evidence that we rely heavily on hearing sounds to make sense of actions and dynamic events in the visual world
  • We investigate the hypothesis that spatiotemporal models for action recognition can be reliably pretrained from unlabeled videos by capturing cross-modal information from audio and video
  • The motivation for our study stems from two fundamental challenges facing a fully-supervised line of attack to learning video models
  • We present three approaches for training video models from self-supervised audio-visual information
  • Unlike previous self-supervised methods that are only pretrained on curated data (e.g., Kinetics [25] without action labels), we report results of XDC pretrained on a large-scale uncurated video dataset
  • We presented Cross-Modal Deep Clustering (XDC), a novel self-supervised model for video and audio
Methods
  • The relative performance of XDC compared to supervised pretrained models stays generally the same when fully vs fc-only finetuned on the downstream task.
  • This suggests that XDC pretraining is useful both as a fixed feature extractor and as a pretraining initialization.
  • This is expected as the label spaces of HMBD51 and UCF101 overlap largely with that of Kinetics
  • This suggests that fully-supervised pretraining is more taxonomy/downstream-task dependent, while the self-supervised XDC is taxonomy-independent
Results
  • ClipOrder [69] R(2+1)D-18 UCF101.
  • MotionPred [64] C3D Kinetics.
  • RotNet3D [23] 3D-ResNet18 Kinetics.
  • ST-Puzzle [26] 3D-ResNet18 Kinetics.
  • DPC [16] AVTS [27]∗.
  • 3D-ResNet34 Kinetics MC3-18 AVTS [27]†
Conclusion
  • The authors presented Cross-Modal Deep Clustering (XDC), a novel self-supervised model for video and audio.
  • XDC outperforms existing self-supervised methods and fully-supervised ImageNet- and Kinetics-pretraining for action recognition.
  • To the best of the knowledge, XDC is the first to show self-supervision outperforming large-scale full-supervision pretraining for action recognition.
  • As the proposed approach is self-supervised, it will learn the inherent properties and structure of the training data.
  • The learned model may exhibit biases intrinsically present in the data
Summary
  • Introduction:

    Do the authors need to explicitly name the actions of “laughing” or “sneezing” in order to recognize them? Or can the authors learn to visually classify them without labels by associating characteristic sounds with these actions? a wide literature in perceptual studies provides evidence that the authors rely heavily on hearing sounds to make sense of actions and dynamic events in the visual world.
  • Can the authors learn to visually classify them without labels by associating characteristic sounds with these actions?
  • A wide literature in perceptual studies provides evidence that the authors rely heavily on hearing sounds to make sense of actions and dynamic events in the visual world.
  • The motivation for the study stems from two fundamental challenges facing a fully-supervised line of attack to learning video models.
  • The first challenge is the exorbitant cost of scaling up
  • Methods:

    The relative performance of XDC compared to supervised pretrained models stays generally the same when fully vs fc-only finetuned on the downstream task.
  • This suggests that XDC pretraining is useful both as a fixed feature extractor and as a pretraining initialization.
  • This is expected as the label spaces of HMBD51 and UCF101 overlap largely with that of Kinetics
  • This suggests that fully-supervised pretraining is more taxonomy/downstream-task dependent, while the self-supervised XDC is taxonomy-independent
  • Results:

    ClipOrder [69] R(2+1)D-18 UCF101.
  • MotionPred [64] C3D Kinetics.
  • RotNet3D [23] 3D-ResNet18 Kinetics.
  • ST-Puzzle [26] 3D-ResNet18 Kinetics.
  • DPC [16] AVTS [27]∗.
  • 3D-ResNet34 Kinetics MC3-18 AVTS [27]†
  • Conclusion:

    The authors presented Cross-Modal Deep Clustering (XDC), a novel self-supervised model for video and audio.
  • XDC outperforms existing self-supervised methods and fully-supervised ImageNet- and Kinetics-pretraining for action recognition.
  • To the best of the knowledge, XDC is the first to show self-supervision outperforming large-scale full-supervision pretraining for action recognition.
  • As the proposed approach is self-supervised, it will learn the inherent properties and structure of the training data.
  • The learned model may exhibit biases intrinsically present in the data
Tables
  • Table1: Single-modality vs. multi-modal deep clustering. We compare the four self-supervised deep clustering models (Section 3) and the two baselines, Scratch and Supervised Pretraining (Superv). Models are pretrained via self-supervision on Kinetics and fully finetuned on each downstream dataset. We report the top-1 accuracy on split-1 of each dataset. All multi-modal models significantly outperform the single-modality deep clustering model. We mark in bold the best and underline the second-best models
  • Table2: The number of clusters (k). We show the effect of the number of k-means clusters on XDC
  • Table3: Pretraining data type and size. We compare XDC pretrained on five datasets vs. fully-supervised pretrained baselines (Superv). XDC significantly outperforms fully-supervised pretraining on HMDB51
  • Table4: Curated vs. uncurated pretraining data. XDC pretrained on IG-Kinetics (curated) vs. IG-Random
  • Table5: Full finetuning vs. learning fc-only. We compare XDC against the supervised pretrained models
  • Table6: XDC clusters. Top and bottom audio (left) and video (right) XDC clusters ranked by clustering purity w.r.t. Kinetics labels. For each cluster, we list the three concepts with the highest purity (given in parentheses)
  • Table7: State-of-the-art comparison. We report the average top-1 accuracy over the official splits for all benchmarks. (a) Video action recognition: Comparison between XDC with self-supervised and fullysupervised methods on UCF101 and HMDB51. Not only does XDC set new state-of-the-art performance for self-supervised methods, it also outperforms fully-supervised Kinetics and ImageNet pretraining. ∗ For fair comparison with XDC, we report AVTS performance without dense prediction, i.e., we average the predictions of 10 uniformly-sampled clips at inference. † For direct comparison with XDC, we evaluate AVTS using R(2+1)D-18 and 10 uniformly-sampled clips at inference. (b) Audio event classification: We compare XDC. a) compares XDC pretrained on four large-scale datasets against state-of-the-art self-supervised methods, after finetuning on the UCF101 and HMDB51 benchmarks2. We also compare against two fully-supervised methods pretrained on ImageNet and Kinetics. Results: (I) XDC pretrained on IG-Kinetics sets new state-of-the-art performance for self-supervised methods on both benchmarks, outperforming AVTS [<a class="ref-link" id="c27" href="#r27">27</a>] by 6.4% on UCF101 and 10.8% on HMDB51. Moreover, XDC significantly outperforms fully-supervised pretraining on Kinetics: by 1.3% on UCF101 and by 3.8% on HMDB51. (II) When directly compared on the same R(2+1)D-18 architecture, XDC pretrained on Kinetics slightly outperforms AVTS by 0.6% on UCF101 and 0.3% on HMDB51. However, when both methods are pretrained on AudioSet, XDC outperforms AVTS with larger. b) compares XDC pretrained on AudioSet and IG-Random against the state-of-the-art self-supervised methods for audio classification. XDC achieves state-ofthe-art performance on DCASE and competitive results on ESC50 with only a 1.1% gap with [<a class="ref-link" id="c50" href="#r50">50</a>]
  • Table8: Training parameter definitions. The abbreviations and descriptions of each training parameters
  • Table9: Pretraining parameters. We use early-stopping for Kinetics and AudioSet since we observe some overfiting on the pretext tasks. For the last iteration of XDC on IG-Kinetics and IG-Random, we pretrain XDC
  • Table10: Finetuning parameters. Different pretraining methods have different ranges of optimal base learning rate when finetuning on downstream tasks. Thus, we cross-validate all methods with the same set of base learning rates and report the best result for each method. γ is set to 0.01 for all settings
  • Table11: Finetuning base learning rates. For a fair comparison, we cross-validate all pretraining methods with the same set of base learning rates. We report the best finetuning result for each method. Learning FC-only benefits from cross-validation with a wider range of base learning rates
  • Table12: XDC audio clusters. Top and bottom 10 XDC audio clusters ranked by clustering purity w.r.t
  • Table13: XDC video clusters. Top and bottom 10 XDC video clusters ranked by clustering purity w.r.t. Kinetics labels. For each, we list the 5 concepts with the highest purity (given in parentheses)
Download tables as Excel
Related work
  • Early unsupervised representation learning. Pioneering works include deep belief networks [19], autoencoders [20, 59], shift-invariant decoders [47], sparse coding algorithms [31], and stacked ISAs [30]. While these approaches learn by reconstructing the input, our approach learns from a self-supervised pretext task by generating pseudo-labels for supervised learning from unlabeled data.

    Self-supervised representation learning from images and videos. Several pretext tasks exploit image spatial context, e.g., by predicting the relative position of patches [6] or solving jigsaw puzzles [37]. Others include creating image classification pseudo-labels (e.g., through artificial rotations [11] or clustering features [4]), colorization [70], inpainting [43], motion segmentation [42], and instance counting [38]. Some works have extended image pretext tasks to video [26, 64, 69]. Other video pretext tasks include frame ordering [7, 32, 35, 68], predicting flow or colors [29, 62], exploiting region correspondences across frames [21, 22, 65, 66], future frame prediction [33, 34, 54, 60, 61], and tracking [67]. Unlike this prior work, our model uses two modalities: video and audio.
Reference
  • Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In NeurIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation learning with odd-one-out networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017.
    Google ScholarLocate open access versionFindings
  • A Gentile and S DiFrancesca. Academic achievement test performance of hearing-impaired students. united states, spring, 1969.(series d, no. 1). washington, dc: Gallaudet university. Center for Assessment and Demographic Studies, 1969.
    Google ScholarFindings
  • Deepti Ghadiyaram, Matt Feiszli, Du Tran, Xueting Yan, Heng Wang, and Dhruv Mahajan. Large-scale weakly-supervised pre-training for video action recognition. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
    Findings
  • Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The "something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision transfer. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. arXiv preprint arXiv:1909.04656, 2019.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Rickye S. Heffner and Henry E. Heffner. Evolution of Sound Localization in Mammals, pages 691–715. Springer New York, New York, NY, 1992.
    Google ScholarFindings
  • Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 2006.
    Google ScholarLocate open access versionFindings
  • Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 2006.
    Google ScholarLocate open access versionFindings
  • Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H Adelson. Learning visual groups from co-occurrences in space and time. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Dinesh Jayaraman and Kristen Grauman. Slow and steady feature analysis: higher order temporal coherence in video. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Longlong Jing and Yingli Tian. Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387, 2018.
    Findings
  • Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
    Findings
  • Dahun Kim, Donghyeon Cho, and In So Kweon. Self-supervised video representation learning with space-time cubic puzzles. In AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011.
    Google ScholarLocate open access versionFindings
  • Zihang Lai and Weidi Xie. Self-supervised learning for video correspondence flow. BMVC, 2019.
    Google ScholarFindings
  • Quoc V Le, Will Y Zou, Serena Y Yeung, and Andrew Y Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011.
    Google ScholarLocate open access versionFindings
  • Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Efficient sparse coding algorithms. In NeurIPS, 2007.
    Google ScholarLocate open access versionFindings
  • Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Helmer R Myklebust. The psychology of deafness: Sensory deprivation, learning, and adjustment. Grune & Stratton, 1960.
    Google ScholarLocate open access versionFindings
  • Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Risto Näätänen. Attention and Brain Function. Lawrence Erlbaum Associates, Inc, 1992.
    Google ScholarFindings
  • Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Karol J. Piczak. Environmental sound classification with convolutional neural networks. MLSP, 2015.
    Google ScholarLocate open access versionFindings
  • Karol J. Piczak. Esc: Dataset for environmental sound classification. In ACM Multimedia, 2015.
    Google ScholarLocate open access versionFindings
  • Jordi Pons and Xavier Serra. Randomly weighted cnns for (music) audio classification. In
    Google ScholarLocate open access versionFindings
  • Marc’aurelio Ranzato, Fu Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007.
    Google ScholarFindings
  • Guido Roma, Waldo Nogueira, and Perfecto Herrera. Recurrence quantification analysis features for environmental sound recognition. WASPAA, 2013.
    Google ScholarLocate open access versionFindings
  • Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, and Antonio Torralba. Self-supervised audio-visual co-segmentation. In ICASSP, 2019.
    Google ScholarLocate open access versionFindings
  • Hardik B. Sailor, Dharmesh M Agrawal, and Hemant A Patil. Unsupervised filterbank learning using convolutional restricted boltzmann machine for environmental sound classification. In INTERSPEECH, 2017.
    Google ScholarLocate open access versionFindings
  • Andrew M Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y Ng. On random weights and unsupervised feature learning. In ICML, 2011.
    Google ScholarLocate open access versionFindings
  • Ladan Shams and Robyn Kim. Crossmodal influences on visual perception. Physics of Life Reviews, 7(3):269–284, 2010.
    Google ScholarLocate open access versionFindings
  • Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human action classes from videos in the wild. In CRCV-TR-12-01, 2012.
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D. Plumbley. Detection and classification of acoustic scenes and events. TM, 2015.
    Google ScholarLocate open access versionFindings
  • D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley. Detection and classification of acoustic scenes and events. IEEE Transactions on Multimedia, 17(10):1733– 1746, Oct 2015.
    Google ScholarLocate open access versionFindings
  • Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
    Findings
  • Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
    Google ScholarLocate open access versionFindings
  • Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In NeurIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. Tracking emerges by colorizing videos. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In ICCV, 2013.
    Google ScholarLocate open access versionFindings
  • Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Selfsupervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycleconsistency of time. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Allan Jabri, and Alexei A. Efros. Learning correspondence from the cycleconsistency of time. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. The sound of pixels. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments