Self-Supervised Learning by Cross-Modal Audio-Video Clustering

neural information processing systems, 2019.

Cited by: 30|Bibtex|Views38|Links
Keywords:
andrew y nglarge scaleConcatenation Deep Clusteringvideo datasetMulti-Head Deep ClusteringMore(11+)
Weibo:
We propose Cross-Modal Deep Clustering, a novel self-supervised method that leverages unsupervised clustering in one modality as a supervisory signal for the other modality

Abstract:

The visual and audio modalities are highly correlated yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of vide...More

Code:

Data:

0
Introduction
  • Do the authors need to explicitly name the actions of “laughing” or “sneezing” in order to recognize them? Or can the authors learn to visually classify them without labels by associating characteristic sounds with these actions? a wide literature in perceptual studies provide evidence that humans rely heavily on hearing sounds to make sense of actions and dynamic events in the visual world.
  • Introduced video datasets differ substantially in their label spaces, which range from objectindependent actions [16] to sports [26] and verb-noun pairs representing activities in the kitchen [7]
  • This suggests that the definition of the “right” label space for action recognition, and more generally for video understanding, is still very much up for debate.
  • It implies that finetuning models pretrained on large-scale labeled datasets is a suboptimal proxy for learning models for small- or mediumsize datasets due to the label-space gap often encountered between source and target datasets
Highlights
  • Do we need to explicitly name the actions of “laughing” or “sneezing” in order to recognize them? Or can we learn to visually classify them without labels by associating characteristic sounds with these actions? a wide literature in perceptual studies provide evidence that humans rely heavily on hearing sounds to make sense of actions and dynamic events in the visual world
  • We demonstrate that self-supervised cross-modal learning with XDC on a large-scale video dataset yields an action recognition model that achieves higher accuracy when finetuned on HMDB51 or UCF101, compared to that produced by fully-supervised pretraining on Kinetics
  • To the best of our knowledge, XDC is the first method to demonstrate that self-supervision can outperform large-scale full-supervision in representation learning for action recognition. (III) The performance of the fullysupervised pretrained model is influenced by the taxonomy of the pretraining data more than the size
  • To the best of our knowledge, XDC is the first method to demonstrate that self-supervision can outperform large-scale full-supervision in representation learning for action recognition. (II) XDC pretrained on IG65M sets new state-of-the-art performance for self-supervised methods on both datasets, as it outperforms the current state-of-the-art self-supervised method AVTS [29] by 5.8% on HMDB51 and 5.2% on UCF101. (III) When constrained to the same pretraining dataset (AudioSet), XDC outperforms AVTS by 2.2% on UCF101 and is only slightly worse than AVTS on HMDB51
  • We have presented Cross-Modal Deep Clustering (XDC), a novel self-supervised learning method for video and audio
  • Our experiments show that XDC significantly outperforms single-modality clustering and other multi-modal variants
  • To the best of our knowledge, XDC is the first method to demonstrate self-supervision outperforming large-scale full-supervision in representation learning for action recognition
Methods
  • Table 3 presents the results of XDC self-supervised pretraining with different data types and sizes, and compares it to fully-supervised pretraining on ImageNet, Kinetics, and AudioSet. Observations: (I) XDC performance improves across all three downstream tasks, as the pretraining data size increases.
  • Supervised-pretraining on Kinetics gives better performance on both UCF101 and HMDB51 compared to supervisedpretraining on AudioSet and ImageNet. One the other hand, XDC performance is less sensitive to the data type, as it implicitly learns the label space rather than depend on a space manually defined by annotators.
  • XDC achieves competitive results with only a 1.7% gap separating it from the the stateof-the-art [55] on ESC50 and a 1% gap with the result of [29] on DCASE
Results
  • Supervised R(2+1)D-18 ImageNet 82.8 46.7.
  • Supervised R(2+1)D-18 Kinetics 93.1 63.6.
  • Misra et al [38] CaffeNet UCF/HMDB 50.2 18.1.
  • Buchler et al [3] CaffeNet UCF/HMDB 58.6 25.0 OPN [34] VGG.
  • MotionPred [68] C3D
Conclusion
  • The authors have presented Cross-Modal Deep Clustering (XDC), a novel self-supervised learning method for video and audio.
  • The authors' experiments showed that XDC outperforms existing self-supervised representation learning methods and fully-supervised ImageNet- and Kineticspretrained models in action recognition.
  • To the best of the knowledge, XDC is the first method to demonstrate self-supervision outperforming large-scale full-supervision in representation learning for action recognition
Summary
  • Introduction:

    Do the authors need to explicitly name the actions of “laughing” or “sneezing” in order to recognize them? Or can the authors learn to visually classify them without labels by associating characteristic sounds with these actions? a wide literature in perceptual studies provide evidence that humans rely heavily on hearing sounds to make sense of actions and dynamic events in the visual world.
  • Introduced video datasets differ substantially in their label spaces, which range from objectindependent actions [16] to sports [26] and verb-noun pairs representing activities in the kitchen [7]
  • This suggests that the definition of the “right” label space for action recognition, and more generally for video understanding, is still very much up for debate.
  • It implies that finetuning models pretrained on large-scale labeled datasets is a suboptimal proxy for learning models for small- or mediumsize datasets due to the label-space gap often encountered between source and target datasets
  • Methods:

    Table 3 presents the results of XDC self-supervised pretraining with different data types and sizes, and compares it to fully-supervised pretraining on ImageNet, Kinetics, and AudioSet. Observations: (I) XDC performance improves across all three downstream tasks, as the pretraining data size increases.
  • Supervised-pretraining on Kinetics gives better performance on both UCF101 and HMDB51 compared to supervisedpretraining on AudioSet and ImageNet. One the other hand, XDC performance is less sensitive to the data type, as it implicitly learns the label space rather than depend on a space manually defined by annotators.
  • XDC achieves competitive results with only a 1.7% gap separating it from the the stateof-the-art [55] on ESC50 and a 1% gap with the result of [29] on DCASE
  • Results:

    Supervised R(2+1)D-18 ImageNet 82.8 46.7.
  • Supervised R(2+1)D-18 Kinetics 93.1 63.6.
  • Misra et al [38] CaffeNet UCF/HMDB 50.2 18.1.
  • Buchler et al [3] CaffeNet UCF/HMDB 58.6 25.0 OPN [34] VGG.
  • MotionPred [68] C3D
  • Conclusion:

    The authors have presented Cross-Modal Deep Clustering (XDC), a novel self-supervised learning method for video and audio.
  • The authors' experiments showed that XDC outperforms existing self-supervised representation learning methods and fully-supervised ImageNet- and Kineticspretrained models in action recognition.
  • To the best of the knowledge, XDC is the first method to demonstrate self-supervision outperforming large-scale full-supervision in representation learning for action recognition
Tables
  • Table1: Single-modality vs. multi-modal deep clustering. We compare the four self-supervised deep clustering models (Section 3) and the two baselines, Scratch and Supervised Pretraining (Superv). The pretraining dataset is Kinetics, and the performance reported is the top-1 accuracy on split-1 of each downstream task. We can observe that all multi-modal models significantly outperform the single-modality deep clustering model. We mark in bold the best model and underline the second-best for each dataset
  • Table2: The number of clusters (k). We show the effect of the number of k-means clusters on the performance of our XDC model. We show results when XDC is pretrained with selfsupervision on three large datasets, and then finetuned with full supervision on three medium-size downstream datasets. The performance reported is the top-1 accuracy on split-1 of each downstream task. We can observe that the best k value increases as the size of the pretraining dataset increases
  • Table3: Pretraining data type and size. We compare XDC pretrained on four datasets vs. supervised pretrained (Superv) baselines models. The performance reported is the top-1 accuracy on split-1 of each downstream task. XDC performance improves as we increase the pretraining data size. XDC significantly outperforms fully-supervised pretraining on HMDB51 and UCF101
  • Table4: Full finetuning vs. learning fc-only. We compare XDC against the supervised pretrained models (Superv) under two tranfer-learning schemes: when models are used as features extractor (‘fc’ column) and as a finetuning initialization (‘all’ column) for the downstream tasks. XDC fixed features outperform several fully-finetuned supervised models
  • Table5: XDC audio clusters. Top and bottom XDC audio clusters ranked by clustering purity w.r.t. Kinetics labels. For each, we list the 3 concepts with the highest purity (given in parentheses)
  • Table6: XDC video clusters. Top and bottom XDC video clusters ranked by clustering purity w.r.t. Kinetics labels. For each, we list the 3 concepts with the highest purity (given in parentheses)
  • Table7: State-of-the-art on video action recognition. Comparison between XDC with self-supervised and fully-supervised methods on UCF101 and HMDB51 benchmarks. We report the average top-1 accuracy over the official splits. Not only does XDC set new state-of-the-art performance for self-supervised methods, it also outperforms fully-supervised Kinetics and ImageNet pretraining
  • Table8: State-of-the-art on audio event classification. We compare XDC with self-supervised methods on ESC50 and DCASE
  • Table9: Training parameter definitions. The abbreviations and descriptions of each training parameters
  • Table10: Pretraining parameters. We use early-stopping for
  • Table11: Finetuning parameters. Different pretraining methods have different ranges of optimal base learning rate when finetuning on downstream tasks. Thus, we cross-validate all methods with the same set of base learning rates and report the best result for each method. γ is set to 0.01 for all settings
  • Table12: Finetuning base learning rates. For a fair comparison, we cross-validate all pretraining methods with the same set of base learning rates. We report the best finetuning result for each method. Learning FC-only benefits from cross-validation with a wider range of base learning rates
  • Table13: XDC audio clusters. Top and bottom 10 XDC audio clusters ranked by clustering purity w.r.t. Kinetics labels. For each, we list the 5 concepts with the highest purity (given in parentheses)
  • Table14: XDC video clusters. Top and bottom 10 XDC video clusters ranked by clustering purity w.r.t. Kinetics labels. For each, we list the 5 concepts with the highest purity (given in parentheses)
Download tables as Excel
Related work
  • Early unsupervised representation learning. Pioneering work in unsupervised learning of video models includes deep belief networks [21], autoencoders [22, 63], shiftinvariant decoders [52], sparse coding algorithms [33], and stacked ISAs [32]. While these approaches learn by reconstructing the input, our approach focuses on learning from a self-supervised pretext task, in which we construct pseudolabels for supervised learning from the unlabeled data. Self-supervised representation learning from images. One line of pretext tasks exploits spatial context in images, e.g. by predicting the relative position of patches [8] or solving jigsaw puzzles [41]. Another line of work involves creating image classification pseudo-labels, e.g. through artificial rotations [13] or clustering features [5]. Other selfsupervised tasks include colorization [74], inpainting [47], motion segmentation [46], and instance counting [42]. Self-supervised representation learning from video. In recent years, several approaches have extended image pretext tasks to video [28, 68, 73]. Other pretext tasks for video modeling include temporal frame ordering [9, 34, 38, 72], establishing region or object correspondences across frames [23, 24, 69, 70], predicting flow [31] or colors [66], as well as tracking [71]. Moreover, a large set of selfsupervised approaches on video are built upon the pretext task of future frame prediction [36, 37, 59, 64, 65]. Unlike this prior work, our self-supervised model makes use of two modalities, video and audio. Cross-modal learning and distillation. Inspired by the human multi-modal sensory system [6, 40], researchers have explored learning using multiple modalities, e.g. video and audio. Here, we review the relevant work in this area and contrast these methods with our cross-modal deep clustering approach. Several approaches [2, 17] train with full supervision an encoder on one modality (e.g. RGB) and then use the pretrained encoder to transfer the discriminative knowledge to an encoder of a different modality (e.g. depth). Unlike these methods, our approach is purely selfsupervised and does not require pretraining an encoder with full supervision. Other approaches learn from unlabeled multi-modal data for a specific target task, e.g. sound source localization [75] and audio-visual co-segmentation [54]. Instead our method aims at learning general visual and audio representations that transfer well to a wide range of downstream tasks. Previous cross-modal self-supervised representation learning methods most relevant to our work include audio-visual correspondence [1], audio-visual temporal synchronization [29, 44], and learning image representations using ambient sound [45]. While audio-visual correspondence [1, 45] uses only a single frame of a video, our method takes a video clip as an input. Temporal synchronization [29, 44] requires constructing positive/negative examples corresponding to in-sync and out-of-sync videoaudio pairs. This sampling strategy makes these approaches more difficult to scale compared to ours, as many potential out-of-sync pairs can be generated, yielding largely different results depending on the sampling choice [29].
Reference
  • Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In ICCV, 2017. 2, 5, 8, 9
    Google ScholarLocate open access versionFindings
  • Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In NeurIPS, 2016. 2, 9
    Google ScholarLocate open access versionFindings
  • Uta Buchler, Biagio Brattoli, and Bjorn Ommer. Improving spatiotemporal self-supervision by deep reinforcement learning. In ECCV, 2018. 8
    Google ScholarLocate open access versionFindings
  • Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015. 1
    Google ScholarLocate open access versionFindings
  • Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018. 2, 3, 4
    Google ScholarLocate open access versionFindings
  • Guillem Collell Talleda and Marie-Francine Moens. Is an image worth more than a thousand words? on the fine-grain semantic differences between visual and linguistic representations. In COLING, 2012
    Google ScholarLocate open access versionFindings
  • Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018. 1
    Google ScholarLocate open access versionFindings
  • Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation learning with odd-one-out networks. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and humanlabeled dataset for audio events. In ICASSP, 2017. 4, 12
    Google ScholarFindings
  • A Gentile and S DiFrancesca. Academic achievement test performance of hearing-impaired students. united states, spring, 1969.(series d, no. 1). washington, dc: Gallaudet university. Center for Assessment and Demographic Studies, 1969. 1
    Google ScholarLocate open access versionFindings
  • Deepti Ghadiyaram, Matt Feiszli, Du Tran, Xueting Yan, Heng Wang, and Dhruv Mahajan. Large-scale weaklysupervised pre-training for video action recognition. In CVPR, 2019. 4, 12
    Google ScholarLocate open access versionFindings
  • Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. ICLR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. 12
    Findings
  • Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019. 6
    Google ScholarLocate open access versionFindings
  • Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The ”something something” video database for learning and evaluating visual common sense. In ICCV, 2017. 1
    Google ScholarLocate open access versionFindings
  • Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision transfer. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. arXiv preprint arXiv:1909.04656, 2019. 8, 9
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
  • Rickye S. Heffner and Henry E. Heffner. Evolution of Sound Localization in Mammals, pages 691–715. Springer New York, New York, NY, 1992. 1
    Google ScholarFindings
  • Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 2006. 2
    Google ScholarFindings
  • Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 2006. 2
    Google ScholarLocate open access versionFindings
  • Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H Adelson. Learning visual groups from co-occurrences in space and time. ICLR, 2015. 2
    Google ScholarLocate open access versionFindings
  • Dinesh Jayaraman and Kristen Grauman. Slow and steady feature analysis: higher order temporal coherence in video. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Longlong Jing and Yingli Tian. Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387, 2018. 8
    Findings
  • Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 1
    Google ScholarLocate open access versionFindings
  • Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017. 1, 4, 12
    Findings
  • Dahun Kim, Donghyeon Cho, and In So Kweon. Selfsupervised video representation learning with space-time cubic puzzles. In AAAI, 2019. 2, 8
    Google ScholarLocate open access versionFindings
  • Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018. 2, 5, 6, 8, 9
    Google ScholarLocate open access versionFindings
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011. 4, 7
    Google ScholarLocate open access versionFindings
  • Zihang Lai and Weidi Xie. Self-supervised learning for video correspondence flow. BMVC, 2019. 2
    Google ScholarFindings
  • Quoc V Le, Will Y Zou, Serena Y Yeung, and Andrew Y Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. 2
    Google ScholarLocate open access versionFindings
  • Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Efficient sparse coding algorithms. In NeurIPS, 2007. 2
    Google ScholarLocate open access versionFindings
  • Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and MingHsuan Yang. Unsupervised representation learning by sorting sequences. In ICCV, 2017. 2, 8
    Google ScholarLocate open access versionFindings
  • David Li, Jason Tam, and Derek Toub. Auditory scene classification using machine learning techniques. AASP Challenge, 2013. 9
    Google ScholarLocate open access versionFindings
  • William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. ICLR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. ICLR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, 2016. 2, 8
    Google ScholarLocate open access versionFindings
  • Helmer R Myklebust. The psychology of deafness: Sensory deprivation, learning, and adjustment. 1960. 1
    Google ScholarFindings
  • Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In ICML, 2011. 2
    Google ScholarLocate open access versionFindings
  • Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • Risto Ntnen. Attention and Brain Function. 1992. 1
    Google ScholarFindings
  • Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • Deepak Pathak, Ross Girshick, Piotr Dollar, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Karol J. Piczak. Environmental sound classification with convolutional neural networks. MLSP, 2015. 9
    Google ScholarLocate open access versionFindings
  • Karol J. Piczak. Esc: Dataset for environmental sound classification. In ACM Multimedia, 2015. 4, 7, 9
    Google ScholarLocate open access versionFindings
  • Jordi Pons and Xavier Serra. Randomly weighted cnns for (music) audio classification. In ICASSP, 2019. 4
    Google ScholarLocate open access versionFindings
  • Alain Rakotomamonjy and Gilles Gasso. Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 2015. 9
    Google ScholarLocate open access versionFindings
  • Marc’aurelio Ranzato, Fu Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007. 2
    Google ScholarLocate open access versionFindings
  • Guido Roma, Waldo Nogueira, and Perfecto Herrera. Recurrence quantification analysis features for environmental sound recognition. WASPAA, 2013. 9
    Google ScholarLocate open access versionFindings
  • Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, and Antonio Torralba. Self-supervised audio-visual co-segmentation. In ICASSP, 2019. 2
    Google ScholarLocate open access versionFindings
  • Hardik B. Sailor, Dharmesh M Agrawal, and Hemant A Patil. Unsupervised filterbank learning using convolutional restricted boltzmann machine for environmental sound classification. In INTERSPEECH, 2017. 9
    Google ScholarLocate open access versionFindings
  • Andrew M Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y Ng. On random weights and unsupervised feature learning. In ICML, 2011. 4
    Google ScholarLocate open access versionFindings
  • Ladan Shams and Robyn Kim. Crossmodal influences on visual perception. Physics of Life Reviews, 7(3):269–284, 2010. 1
    Google ScholarLocate open access versionFindings
  • Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human action classes from videos in the wild. In CRCV-TR-12-01, 2012. 4, 7
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In ICML, 2015. 2
    Google ScholarLocate open access versionFindings
  • D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley. Detection and classification of acoustic scenes and events. IEEE Transactions on Multimedia, 17(10):1733–1746, Oct 2015. 7
    Google ScholarLocate open access versionFindings
  • Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D. Plumbley. Detection and classification of acoustic scenes and events. TM, 2015. 9
    Google ScholarLocate open access versionFindings
  • Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018. 5, 8
    Google ScholarLocate open access versionFindings
  • Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008. 2
    Google ScholarLocate open access versionFindings
  • Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In NeurIPS, 2016. 2
    Google ScholarLocate open access versionFindings
  • Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. Tracking emerges by colorizing videos. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In ICCV, 2013. 4
    Google ScholarFindings
  • Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In CVPR, 2019. 2, 8
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Allan Jabri, and Alexei A. Efros. Learning correspondence from the cycle-consistency of time. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, 2019. 2, 8
    Google ScholarLocate open access versionFindings
  • Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. The sound of pixels. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments