Video Modeling with Correlation Networks

CVPR 2020, pp. 352-361, 2019.

Cited by: 8|Bibtex|Views157|Links
EI
Keywords:
feature maplearnable correlation operatormotion informationstream networkconvolutional networkMore(11+)
Weibo:
Unlike previous approaches based on optical flow or 3D convolution, we propose a learnable correlation operator which establishes frame-to-frame matches over convolutional feature maps in the different layers of the network

Abstract:

Motion is a salient cue to recognize actions in video. Modern action recognition models leverage motion information either explicitly by using optical flow as input or implicitly by means of 3D convolutional filters that simultaneously capture appearance and motion information. This paper proposes an alternative approach based on a lear...More

Code:

Data:

0
Introduction
  • After the breakthrough of AlexNet [30] on ImageNet [7], convolutional neural networks (CNNs) have become the dominant model for still-image classification [33, 47, 52, 21].
  • CNNs for video analysis have been extended with the capability of capturing appearance information contained in individual frames and motion information extracted from the temporal dimension of the image sequence.
  • This is usually achieved by one of two possible mechanisms.
Highlights
  • After the breakthrough of AlexNet [30] on ImageNet [7], convolutional neural networks (CNNs) have become the dominant model for still-image classification [33, 47, 52, 21]
  • One strategy involves the use of a two-stream network [46, 57, 16, 58, 42, 5] where one stream operates on RGB frames to model appearance information and the other stream extracts motion features from optical flow provided as input
  • Unlike previous approaches based on optical flow or 3D convolution, we propose a learnable correlation operator which establishes frame-to-frame matches over convolutional feature maps in the different layers of the network
  • From the standard 3D convolution, the correlation operator makes the computation of motion information explicit
  • We design the correlation network based on this novel operator and demonstrate its superior performance on various video datasets for action recognition
  • CorrNet-101 is still 1.6% and 5.9% better on Something and Diving48
  • Potential future work includes the application of the learnable correlation operator to other tasks, such as action localization, optical flow, and geometry matching
Methods
  • STC-ResNext-101 [8] R(2+1)D [55].
  • MARS+RGB [5] ip-CSN-152 [54] DynamoNet [9] SlowFast-101 [15] SlowFast-101+NL [15].
  • I3D [4] R(2+1)D [55] NL I3D-101 [59] ip-CSN-152 [54] LGD-3D-101 [42] R(2+1)D [55].
  • I3D [4] S3D-G [63] LGD-3D-101 [42].
Results
  • The authors' correlation network outperforms the state-of-the-art on four different video datasets without using optical.
  • Comparing with CorrNet-26 in Table 3, CorrNet-101 is 4.1%, 4.3% and 3.1% better on Kinetics, Something and Diving48, respectively.
  • Compared to SlowFast-101, CorrNet-101 achieves slightly higher accuracy (79.2% vs 78.9%), and it is only 0.6% lower in accuracy when SlowFast-101 is combined with NL.
  • Comparing with results using pre-training, CorrNet-101 is 1.6% better than LGD-3D [42], i.e., 81.0% vs 79.4%.
  • CorrNet-101 is still 1.6% and 5.9% better on Something and Diving48
Conclusion
  • Unlike previous approaches based on optical flow or 3D convolution, the authors propose a learnable correlation operator which establishes frame-to-frame matches over convolutional feature maps in the different layers of the network.
  • The authors design the correlation network based on this novel operator and demonstrate its superior performance on various video datasets for action recognition.
  • Potential future work includes the application of the learnable correlation operator to other tasks, such as action localization, optical flow, and geometry matching
Summary
  • Introduction:

    After the breakthrough of AlexNet [30] on ImageNet [7], convolutional neural networks (CNNs) have become the dominant model for still-image classification [33, 47, 52, 21].
  • CNNs for video analysis have been extended with the capability of capturing appearance information contained in individual frames and motion information extracted from the temporal dimension of the image sequence.
  • This is usually achieved by one of two possible mechanisms.
  • Methods:

    STC-ResNext-101 [8] R(2+1)D [55].
  • MARS+RGB [5] ip-CSN-152 [54] DynamoNet [9] SlowFast-101 [15] SlowFast-101+NL [15].
  • I3D [4] R(2+1)D [55] NL I3D-101 [59] ip-CSN-152 [54] LGD-3D-101 [42] R(2+1)D [55].
  • I3D [4] S3D-G [63] LGD-3D-101 [42].
  • Results:

    The authors' correlation network outperforms the state-of-the-art on four different video datasets without using optical.
  • Comparing with CorrNet-26 in Table 3, CorrNet-101 is 4.1%, 4.3% and 3.1% better on Kinetics, Something and Diving48, respectively.
  • Compared to SlowFast-101, CorrNet-101 achieves slightly higher accuracy (79.2% vs 78.9%), and it is only 0.6% lower in accuracy when SlowFast-101 is combined with NL.
  • Comparing with results using pre-training, CorrNet-101 is 1.6% better than LGD-3D [42], i.e., 81.0% vs 79.4%.
  • CorrNet-101 is still 1.6% and 5.9% better on Something and Diving48
  • Conclusion:

    Unlike previous approaches based on optical flow or 3D convolution, the authors propose a learnable correlation operator which establishes frame-to-frame matches over convolutional feature maps in the different layers of the network.
  • The authors design the correlation network based on this novel operator and demonstrate its superior performance on various video datasets for action recognition.
  • Potential future work includes the application of the learnable correlation operator to other tasks, such as action localization, optical flow, and geometry matching
Tables
  • Table1: A comparison of the correlation operator with 3D convolution. When the size K of the filter is similar (i.e., K ∗ K ≈ Kt ∗ Ky ∗ Kx), the parameters of 3D convolution is about Cout/L times more than the correlation operator, and its FLOPs is about Cout times higher
  • Table2: The R(2+1)D backbone for building correlation network
  • Table3: Correlation networks vs baselines. Our CorrNet significantly outperforms the two baseline architectures on three datasets, at a very small increase in FLOPs compared to R(2+1)D. Using longer clip length L leads to better accuracy on all three datasets
  • Table4: Action recognition accuracy (%) for different configurations of CorrNet
  • Table5: Action recognition accuracy (%) of CorrNet vs two-stream network
  • Table6: Compare with the state-of-the-art on Kinetics-400
  • Table7: Compare with the state-of-the-art on SomethingSomething v1 and Diving48
  • Table8: Comparison with the state-of-the-art on Sports1M
Download tables as Excel
Related work
  • Architectures for video classification. Among the popular video models, there are two major categories: two-stream networks [46, 57, 16, 58, 42, 5] and 3D CNNs [1, 25, 53, 50, 55, 41, 63, 9, 14]. Since the introduction of two-stream networks [46], further improvements have been achieved by adding connections between the two streams [16], or inflating a 2D model to 3D [4]. 3D CNNs [1, 25, 53] learn appearance and motion information simultaneously by convolving 3D filters in space and time. Successful image architectures [47, 52, 21] have been extended to video using 3D convolution [4, 53, 63]. Recent research [50, 55, 41, 42] shows that decomposing 3D convolution into 2D spatial convolution and 1D temporal convolution leads to better performance. Our correlation network goes beyond twostream networks and 3D convolution, and we propose a new operator that can better learn the temporal dynamics of video sequences. Motion information for action recognition. Before the popularity of deep learning, various video features [32, 45, 29, 10, 56] were hand-designed to encode motion information in video. Besides two-stream networks and 3D CNNs, ActionFlowNet [39] proposes to jointly estimate optical flow and recognize actions in one network. Fan et al [12] and Piergiovanni et al [40] also introduced networks to learn optical flow end-to-end for action recognition.
Funding
  • Proposes an alternative approach based on a learnable correlation operator that can be used to establish frame-toframe matches over convolutional feature maps in the different layers of the network
  • Demonstrates that correlation networks produce strong results on a variety of video datasets, and outperform the state of the art on four popular benchmarks for action recognition: Kinetics, Something-Something, Diving48 and Sports1M
  • Proposes a new scheme based on a novel correlation operator inspired by the correlation layer in FlowNet
  • Proposes a learnable correlation operator to establish frameto-frame matches over convolutional feature maps to capture different notions of similarity in different layers of the network
  • Demonstrates that our correlation network compares favorably with widely-used 3D CNNs for video modeling, and achieves competitive results over the prominent two-stream network while being much faster to train
  • Our correlation network outperforms the state-of-the-art on four different video datasets without using optical
Reference
  • Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. Sequential deep learning for human action recognition. In International Workshop on Human Behavior Understanding, pages 29–39. Springer, 2011, 2
    Google ScholarLocate open access versionFindings
  • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, and Lorenzo Torresani. Learning discriminative motion features through detection. arXiv preprint arXiv:1812.04172, 2018. 7
    Findings
  • Caffe2-Team. Caffe2: A new lightweight, modular, and scalable deep learning framework. https://caffe2.ai/.5
    Findings
  • J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017. 2, 7
    Google ScholarLocate open access versionFindings
  • Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, and Cordelia Schmid. Mars: Motion-augmented rgb stream for action recognition. In CVPR, pages 7882–7891, 2019. 1, 2, 7
    Google ScholarLocate open access versionFindings
  • N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, 2003
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE, 2009. 1, 5
    Google ScholarLocate open access versionFindings
  • Ali Diba, Mohsen Fayyaz, Vivek Sharma, M Mahdi Arzani, Rahman Yousefzadeh, Juergen Gall, and Luc Van Gool. Spatio-temporal channel correlation networks for action classification. ECCV, 2012, 7
    Google ScholarLocate open access versionFindings
  • Ali Diba, Vivek Sharma, Luc Van Gool, and Rainer Stiefelhagen. Dynamonet: Dynamic action and motion network. ICCV, 2011, 2, 7
    Google ScholarLocate open access versionFindings
  • P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In Proc. ICCV VS-PETS, 2005. 2, 4
    Google ScholarLocate open access versionFindings
  • Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, 2015. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • Lijie Fan, Wenbing Huang, Stefano Ermon Chuang Gan, Boqing Gong, and Junzhou Huang. End-to-end learning of motion representation for video understanding. In CVPR, pages 6016–6025, 2018. 2
    Google ScholarLocate open access versionFindings
  • Gunnar Farneback. Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis, pages 363–370. Springer, 2003. 7
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. arXiv preprint arXiv:2004.04730, 2020. 2
    Findings
  • Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In ICCV, pages 6202–6211, 2019. 7
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 201, 2, 5
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In CVPR, pages 3038– 3046, 202
    Google ScholarLocate open access versionFindings
  • Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. 5
    Findings
  • Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The something something video database for learning and evaluating visual common sense. In Proc. ICCV, 2017. 5
    Google ScholarLocate open access versionFindings
  • Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, and Hongsheng Li. Group-wise correlation stereo network. In CVPR, pages 3273–3282, 2019. 4
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 1, 2, 3, 4, 7
    Google ScholarLocate open access versionFindings
  • Omar Hommos, Silvia L Pintea, Pascal SM Mettes, and Jan C van Gemert. Using phase instead of optical flow for action recognition. ECCV Workshops, 2018. 2
    Google ScholarLocate open access versionFindings
  • Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 2018. 2
    Google ScholarLocate open access versionFindings
  • Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, volume 2, page 6, 2017. 2
    Google ScholarLocate open access versionFindings
  • Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE TPAMI, 35(1):221–231, 2013. 1, 2
    Google ScholarLocate open access versionFindings
  • Gagan Kanojia, Sudhakar Kumawat, and Shanmuganathan Raman. Attentive spatio-temporal representation learning for diving classification. CVPR Workshops, 2019. 7
    Google ScholarLocate open access versionFindings
  • Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 1, 5, 7
    Google ScholarLocate open access versionFindings
  • Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. 5
    Findings
  • Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008. 2
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 3, 4
    Google ScholarLocate open access versionFindings
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB51: a large video database for human motion recognition. In ICCV, 2011. 7
    Google ScholarLocate open access versionFindings
  • I. Laptev and Tony Lindeberg. Space-time interest points. In ICCV, 2003. 2
    Google ScholarFindings
  • Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 1
    Google ScholarLocate open access versionFindings
  • Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyutae Park, and Nojun Kwak. Motion feature network: Fixed motion filter for action recognition. In ECCV, pages 387–403, 2018. 2, 7
    Google ScholarLocate open access versionFindings
  • Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: Towards action recognition without representation bias. In ECCV, pages 513–528, 2018. 5, 7
    Google ScholarLocate open access versionFindings
  • Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. ICLR, 2017. 5
    Google ScholarLocate open access versionFindings
  • David G Lowe. Distinctive image features from scaleinvariant keypoints. International journal of computer vision, 60(2):91–110, 2004. 3
    Google ScholarLocate open access versionFindings
  • Chenxu Luo and Alan L Yuille. Grouped spatial-temporal aggregation for efficient action recognition. In ICCV, pages 5512–5521, 2019. 7
    Google ScholarLocate open access versionFindings
  • Joe Yue-Hei Ng, Jonghyun Choi, Jan Neumann, and Larry S Davis. Actionflownet: Learning motion representation for action recognition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1616–1624. IEEE, 2018. 2
    Google ScholarLocate open access versionFindings
  • AJ Piergiovanni and Michael S Ryoo. Representation flow for action recognition. CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Zhaofan Qiu, Ting Yao,, and Tao Mei. Learning spatiotemporal representation with pseudo-3d residual networks. In ICCV, 2017. 1, 2, 4, 8
    Google ScholarLocate open access versionFindings
  • Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, and Tao Mei. Learning spatio-temporal representation with local and global diffusion. In CVPR, pages 12056–12065, 2019. 1, 2, 7
    Google ScholarLocate open access versionFindings
  • Anurag Ranjan and Michael J Black. Optical flow estimation using a spatial pyramid network. In CVPR, volume 2, page 2. IEEE, 2017. 3
    Google ScholarLocate open access versionFindings
  • Ignacio Rocco, Relja Arandjelovic, and Josef Sivic. Convolutional neural network architecture for geometric matching. In Proc. CVPR, volume 2, 2017. 2, 3
    Google ScholarLocate open access versionFindings
  • P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In ACM MM, 2007. 2
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human action classes from videos in the wild. In CRCV-TR-12-01, 2012. 7
    Google ScholarLocate open access versionFindings
  • Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In CVPR, pages 8934–8943, 2018. 2, 3
    Google ScholarLocate open access versionFindings
  • Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In ICCV, 2015. 1, 2, 4
    Google ScholarLocate open access versionFindings
  • Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, and Wei Zhang. Optical flow guided feature: a fast and robust motion representation for video action recognition. In CVPR, 2018. 2
    Google ScholarFindings
  • Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015. 1, 2, 8
    Google ScholarLocate open access versionFindings
  • Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. Video classification with channel-separated convolutional networks. ICCV, 2019. 7, 8
    Google ScholarLocate open access versionFindings
  • Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, pages 6450– 6459, 2018. 1, 2, 4, 7, 8
    Google ScholarLocate open access versionFindings
  • Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In ICCV, 2013. 2, 4
    Google ScholarFindings
  • Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159, 2015. 1, 2, 5
    Findings
  • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. CVPR, 10, 2018. 5, 7, 8
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In ECCV, pages 399–417, 2018. 7
    Google ScholarLocate open access versionFindings
  • Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and Cordelia Schmid. Deepflow: Large displacement optical flow with deep matching. In ICCV, pages 1385–1392, 2013. 2, 3
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, pages 5987–5995. IEEE, 2017. 2, 4
    Google ScholarLocate open access versionFindings
  • Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, pages 305–321, 2018. 1, 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015. 3
    Findings
  • Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, pages 4694–4702, 2015. 8
    Google ScholarLocate open access versionFindings
  • Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l 1 optical flow. Pattern Recognition, pages 214–223, 2007. 7
    Google ScholarLocate open access versionFindings
  • Yue Zhao, Yuanjun Xiong, and Dahua Lin. Recognize actions by disentangling components of dynamics. In CVPR, pages 6566–6575, 2018. 2
    Google ScholarLocate open access versionFindings
  • Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos. In ECCV, pages 803–818, 2018. 7
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments