Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Bruno Korbar
Bruno Korbar

neural information processing systems, pp. 7774-7785, 2018.

Cited by: 90|Bibtex|Views53|Links
EI
Keywords:
cooperative learningvideo modelsaudio and videocurriculum learning
Weibo:
In this work we have shown that the self-supervised mechanism of audio-visual temporal synchronization can be used to learn general and effective models for both the audio and the vision domain

Abstract:

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn audio and video features from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss ...More

Code:

Data:

0
Introduction
  • Image recognition has undergone dramatic progress since the breakthrough of AlexNet [1] and the widespread availability of progressively large datasets such as Imagenet [2].
  • The growth in scale has enabled more effective end-to-end training of deep models and the finer-grained definition of classes has made possible the learning of more discriminative features.
  • This has inspired a new generation of deep video models [17, 18, 19] greatly advancing the field.
  • One may argue that future significant improvements by mere dataset growth will require scaling up existing benchmarks by one or more orders of magnitude, which may not be possible in the short term
Highlights
  • Image recognition has undergone dramatic progress since the breakthrough of AlexNet [1] and the widespread availability of progressively large datasets such as Imagenet [2]
  • We assess the ability of the Audio-Visual Temporal Synchronization (AVTS) procedure to serve as an effective pretraining mechanism for video-based action recognition models
  • We use Kinetics for AVTS learning, as it will allow us to compare the results of our self-supervised pretraining with those obtained by fully-supervised pretraining obtained by making use of action class labels, which are available on Kinetics
  • We report action recognition results using the I3D-RGB [18] network trained in several ways: learned from scratch, pretrained using our self-supervised AVTS, or pretrained with category labels
  • In this work we have shown that the self-supervised mechanism of audio-visual temporal synchronization (AVTS) can be used to learn general and effective models for both the audio and the vision domain
  • We have shown that curriculum learning significantly improves the quality of the features on all end tasks
  • While in this work we trained AVTS on established, labeled video dataset in order to have a direct comparison with fully-supervised pretraining methods, our approach is self-supervised and does not require any manual labeling
Methods
  • 65.7 network uses a sequence of x 3D convolutions in the early layers, followed by 2D convolutions in the subsequent layers.
  • The intuition is that temporal modeling by 3D convolutions is useful in the early layers, while the late layers responsible for the final prediction do not require temporal processing.
  • MCx models were shown to provide a good trade off in terms of video classification accuracy, number of learning parameters, and runtime efficiency.
  • Further implementation details can be found in [26]
Results
  • Evaluation on Audio

    Visual Temporal Synchronization (AVTS)

    The authors first evaluate the approach on the AVTS task.
  • The authors evaluate the audio features learned by the AVTS procedure by minimization of the contrastive loss
  • For this purpose, the authors take the activations from the conv_5 layer of the audio subnetwork and test their quality as audio representation on two established sound classification datasets: ESC-50 [29] and DCASE2014 [30].
  • The authors compute the classification score for each audio sample by averaging the sub-clip scores in the sample, and predict the class having higher score
Conclusion
  • In this work the authors have shown that the self-supervised mechanism of audio-visual temporal synchronization (AVTS) can be used to learn general and effective models for both the audio and the vision domain.
  • This opens up the possibility of self-supervised pretraining on video collections that are much larger than any existing labeled video dataset and that may be derived from many different sources (YouTube, Flickr videos, Facebook posts, TV news, movies, etc.)
  • The authors believe that this may yield further improvements in the generality and effectiveness of the models for downstream tasks in the audio and video domain and it may help bridge the remaining gap with respect to fully-supervised pretraining that rely on costly manual annotations
Summary
  • Introduction:

    Image recognition has undergone dramatic progress since the breakthrough of AlexNet [1] and the widespread availability of progressively large datasets such as Imagenet [2].
  • The growth in scale has enabled more effective end-to-end training of deep models and the finer-grained definition of classes has made possible the learning of more discriminative features.
  • This has inspired a new generation of deep video models [17, 18, 19] greatly advancing the field.
  • One may argue that future significant improvements by mere dataset growth will require scaling up existing benchmarks by one or more orders of magnitude, which may not be possible in the short term
  • Methods:

    65.7 network uses a sequence of x 3D convolutions in the early layers, followed by 2D convolutions in the subsequent layers.
  • The intuition is that temporal modeling by 3D convolutions is useful in the early layers, while the late layers responsible for the final prediction do not require temporal processing.
  • MCx models were shown to provide a good trade off in terms of video classification accuracy, number of learning parameters, and runtime efficiency.
  • Further implementation details can be found in [26]
  • Results:

    Evaluation on Audio

    Visual Temporal Synchronization (AVTS)

    The authors first evaluate the approach on the AVTS task.
  • The authors evaluate the audio features learned by the AVTS procedure by minimization of the contrastive loss
  • For this purpose, the authors take the activations from the conv_5 layer of the audio subnetwork and test their quality as audio representation on two established sound classification datasets: ESC-50 [29] and DCASE2014 [30].
  • The authors compute the classification score for each audio sample by averaging the sub-clip scores in the sample, and predict the class having higher score
  • Conclusion:

    In this work the authors have shown that the self-supervised mechanism of audio-visual temporal synchronization (AVTS) can be used to learn general and effective models for both the audio and the vision domain.
  • This opens up the possibility of self-supervised pretraining on video collections that are much larger than any existing labeled video dataset and that may be derived from many different sources (YouTube, Flickr videos, Facebook posts, TV news, movies, etc.)
  • The authors believe that this may yield further improvements in the generality and effectiveness of the models for downstream tasks in the audio and video domain and it may help bridge the remaining gap with respect to fully-supervised pretraining that rely on costly manual annotations
Tables
  • Table1: AVTS accuracy achieved by our system on the Kinetics test set, which includes negatives of only “easy” type, as in [<a class="ref-link" id="c21" href="#r21">21</a>]. The table shows that curriculum learning with a mix of easy and hard negatives in a second stage of training leads to a significant gain in accuracy
  • Table2: Action recognition accuracy (%) on UCF101 [<a class="ref-link" id="c25" href="#r25">25</a>] and HMDB51 [<a class="ref-link" id="c24" href="#r24">24</a>] using AVTS as a selfsupervised pretraining mechanism. Even though our pretraining does not leverage any manual labels, it yields a remarkable gain in accuracy compared to learning from scratch (+19.9% on UCF101 and +17.7% on HMDB51, for MC3). As expected, making use of Kinetics action labels for pretraining yields further boost. But the accuracy gaps are not too large (only +1.5% on UCF101 with MC3) and may potentially be bridged by making use of a larger pretraining dataset, since no manual cost is involved for our procedure. Additionally, we show that our method generalizes to different families of models, such as I3D-RGB [<a class="ref-link" id="c18" href="#r18">18</a>]. Rows marked with * report numbers as listed in the I3D-RGB paper, which may have used a slightly different setup in terms of data preprocessing and evaluation
  • Table3: Evaluation of audio features learned with AVTS on two audio classification benchmarks: ESC-50 and DCASE2014. "Our audio subnet" denotes our audio subnet directly trained on these benchmarks. The superior performance of our AVTS features suggest the effectiveness of our approach lies in the self-supervised learning procedure rather than in the net architecture
  • Table4: Impact of curriculum learning on AVTS and downstream tasks (audio classification and action recognition). Both L3-Net [<a class="ref-link" id="c21" href="#r21">21</a>] and our AVTS model are pretrained, fine-tuned (when applicable)
Download tables as Excel
Related work
  • Unsupervised learning has been studied for decades in both computer vision and machine learning. Inspirational work in this area includes deep belief networks [33], stacked autoencoders [34], shiftinvariant decoders [35], sparse coding [36], TICA [37], stacked ISAs [38]. Instead of reconstructing the original inputs as typically done in unsupervised learning, self-supervised learning methods try to exploit free supervision from images or videos. Wang et al [39] used tracklets of image patches across video frames as self-supervision. Doersch et al [41] exploited the spatial context of image patches to pre-train a deep ConvNet. Fernando et al [42] used temporal context for self-supervised pre-training, while Misra et al [43] proposed frame-shuffling as a self-supervised task.
Funding
  • This work was funded in part by NSF award CNS-120552
Reference
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2980–2988, 2017.
    Google ScholarLocate open access versionFindings
  • Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 4724–4732, 2016.
    Google ScholarLocate open access versionFindings
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3431–3440, 2015.
    Google ScholarLocate open access versionFindings
  • Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. CoRR, abs/1511.07122, 2015.
    Findings
  • Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei-Fei Li. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 1725–1732, 2014.
    Google ScholarLocate open access versionFindings
  • Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 4489–4497, 2015.
    Google ScholarLocate open access versionFindings
  • Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 3551–3558, 2013.
    Google ScholarLocate open access versionFindings
  • Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
    Findings
  • Chunhui Gu, Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru, David A. Ross, George Toderici, Yeqing Li, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. AVA: A video dataset of spatio-temporally localized atomic visual actions. CoRR, abs/1705.08421, 2017.
    Findings
  • Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 961–970, 2015.
    Google ScholarLocate open access versionFindings
  • Hang Zhao, Zhicheng Yan, Heng Wang, Lorenzo Torresani, and Antonio Torralba. SLAC: A sparsely labeled dataset for action classification and localization. CoRR, abs/1712.09374, 2017.
    Findings
  • Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pages 510–526, 2016.
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. Spatiotemporal residual networks for video action recognition. In Advances in neural information processing systems, pages 3468–3476, 2016.
    Google ScholarLocate open access versionFindings
  • João Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4724–4733, 2017.
    Google ScholarLocate open access versionFindings
  • Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 892–900, 2016.
    Google ScholarLocate open access versionFindings
  • Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 609–617, 2017.
    Google ScholarLocate open access versionFindings
  • Relja Arandjelovic and Andrew Zisserman. Objects that sound. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, pages 451–466, 2018.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, New York, NY, USA, 2009. ACM.
    Google ScholarLocate open access versionFindings
  • Hilde Kuehne, Hueihan Jhuang, Rainer Stiefelhagen, and Thomas Serre. Hmdb51: A large video database for human motion recognition. In High Performance Computing in Science and Engineering ‘12, pages 571–582.
    Google ScholarLocate open access versionFindings
  • Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
    Findings
  • Joon Son Chung and Andrew Zisserman. Out of time: Automated lip sync in the wild. In Computer Vision - ACCV 2016 Workshops - ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II, pages 251–263, 2016.
    Google ScholarLocate open access versionFindings
  • Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
    Google ScholarLocate open access versionFindings
  • Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
    Google ScholarLocate open access versionFindings
  • Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pages 1015–1018. ACM Press, 2015.
    Google ScholarLocate open access versionFindings
  • Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D Plumbley. Detection and classification of acoustic scenes and events. IEEE Transactions on Multimedia, 17(10):1733–1746, 2015.
    Google ScholarLocate open access versionFindings
  • Hardik B Sailor, Dharmesh M Agrawal, and Hemant A Patil. Unsupervised filterbank learning using convolutional restricted boltzmann machine for environmental sound classification. Proc. Interspeech 2017, pages 3107–3111, 2017.
    Google ScholarLocate open access versionFindings
  • Andrew Owens and Alexei A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VI, pages 639–658, 2018.
    Google ScholarLocate open access versionFindings
  • Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pages 153–160, 2006.
    Google ScholarLocate open access versionFindings
  • Marc’Aurelio Ranzato, Fu Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA, 2007.
    Google ScholarLocate open access versionFindings
  • Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pages 801–808, 2006.
    Google ScholarLocate open access versionFindings
  • Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg Corrado, Jeff Dean, and Andrew Ng. Building high-level features using large scale unsupervised learning. In ICML, 2011.
    Google ScholarLocate open access versionFindings
  • Quoc V. Le, Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pages 3361–3368, 2011.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • H. Izadinia, I. Saleemi, and M. Shah. Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia, 15(2):378–390, Feb 2013.
    Google ScholarLocate open access versionFindings
  • Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation learning with odd-one-out networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5729–5738, 2017.
    Google ScholarLocate open access versionFindings
  • Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuffle and learn: Unsupervised learning using temporal order verification. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 527–544, Cham, 2016. Springer International Publishing.
    Google ScholarLocate open access versionFindings
  • Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision transfer. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2827–2836, 2016.
    Google ScholarLocate open access versionFindings
  • Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In Computer Vision - ECCV 2016 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pages 801–816, 2016.
    Google ScholarLocate open access versionFindings
  • Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh H. McDermott, and Antonio Torralba. The sound of pixels. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, pages 587–604, 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments