Multi-Scale Spatial-Temporal Integration Convolutional Tube for Human Action Recognition

Haoze Wu
Haoze Wu
Jiawei Liu
Jiawei Liu
Xierong Zhu
Xierong Zhu
Zheng-Jun Zha
Zheng-Jun Zha

IJCAI, pp. 753-759, 2020.

Cited by: 0|Bibtex|Views49|Links
EI
Keywords:
scale spatial temporal integration convolutional tubeaction recognitionspatial appearanceComputer Vision: Action Recognitiontemporal featureMore(11+)
Weibo:
We propose a novel and efficient Multi-Scale Spatial-Temporal Integration Convolutional Tube aiming at achieving accurate recognition of actions with lower computational cost

Abstract:

Applying multi-scale representations leads to consistent performance improvements on a wide range of image recognition tasks. However, with the addition of the temporal dimension in video domain, directly obtaining layer-wise multi-scale spatial-temporal features will add a lot extra computational cost. In this work, we propose a novel an...More

Code:

Data:

0
Introduction
  • With the rapid development of various video platforms in the social network, video is becoming a popular communication medium among internet users.
  • Traditional 2D-CNN base methods [Simonyan and Zisserman, 2014; Donahue et al, 2015] neglected the joint exploration of spatial appearance and temporal motion, which could offer a comprehensive representation of videos and enhance the accuracy of action recognition.
  • The P3D [Qiu et al, 2017] took the lead in separating the 3D convolution into two separate convolutions, i.e., a 2D spatial convolution plus a 1D temporal convolution, and significantly reduced the model size
  • This kind of method still ignored the correlation of spatial appearance and temporal motion
Highlights
  • With the rapid development of various video platforms in the social network, video is becoming a popular communication medium among internet users
  • Traditional 2D-Convolutional NeuralNetworks base methods [Simonyan and Zisserman, 2014; Donahue et al, 2015] neglected the joint exploration of spatial appearance and temporal motion, which could offer a comprehensive representation of videos and enhance the accuracy of action recognition
  • We propose a Multi-Scale Spatial-Temporal Integration Convolutional Tube (MSTI) aiming towards robust and accurate human action recognition tasks
  • The comparison indicates that our Multi-Scale Spatial-Temporal Integration Convolutional Tube-Net can learn more effective spatial-temporal features much more efficiently
  • We address the problem of building highly efficient deep neural networks for human action recognition from the perspectives of generating multi-scale representations and integrating multi-scale spatial-temporal features
  • We propose a novel Multi-Scale Spatial-Temporal Integration Convolutional Tube in which the multi-scale convolution block generates multi-scale spatial appearance and temporal motion, and the cross-scale attention weighted blocks perform feature recalibration by integrating multi-scale spatial and temporal features
Methods
Results
  • Table 3 shows the performance comparison of the proposed MSTI-Net against ten state-of-the-art methods in terms of Top-1 classification accuracy on Kinetics-400.
  • The authors' MSTI-Net improves the second best compared method SlowFast[Feichtenhofer et al, 2019] by 0.5% in terms of Top-1 classification accuracy.
  • The total number of parameters and FLOPs of the MSTI-Net are much fewer than those of most methods in the table.
  • Compared to the second best method MRST-Net[Wu et al, 2019a] on
Conclusion
  • The authors address the problem of building highly efficient deep neural networks for human action recognition from the perspectives of generating multi-scale representations and integrating multi-scale spatial-temporal features.
  • The authors propose a novel Multi-Scale Spatial-Temporal Integration Convolutional Tube in which the multi-scale convolution block generates multi-scale spatial appearance and temporal motion, and the cross-scale attention weighted blocks perform feature recalibration by integrating multi-scale spatial and temporal features.
  • Benefiting from the two blocks, the MSTI-Net requires significantly less computational resources yet achieving the state-of-the-art action recognition accuracy
Summary
  • Introduction:

    With the rapid development of various video platforms in the social network, video is becoming a popular communication medium among internet users.
  • Traditional 2D-CNN base methods [Simonyan and Zisserman, 2014; Donahue et al, 2015] neglected the joint exploration of spatial appearance and temporal motion, which could offer a comprehensive representation of videos and enhance the accuracy of action recognition.
  • The P3D [Qiu et al, 2017] took the lead in separating the 3D convolution into two separate convolutions, i.e., a 2D spatial convolution plus a 1D temporal convolution, and significantly reduced the model size
  • This kind of method still ignored the correlation of spatial appearance and temporal motion
  • Methods:

    C3D [Tran et al, 2015] LRCN [Donahue et al, 2015] ARTNet [Wang et al, 2018a] I3D-RGB [Carreira and Zisserman, 2017] StNet [He et al, 2018] R(2+1)D-RGB [Tran et al, 2018] S3D [Xie et al, 2018] MRST-Net [Wu et al, 2019a] CFST [Wu et al, 2019b] Nonlocal-I3D [Wang et al, 2018b] SlowFast [Feichtenhofer et al, 2019] MSTI Backbone -.
  • 4.3 Comparison to the State-of-the-Art Methods.
  • TSN[Wang et al, 2016] Res3D[Tran et al, 2017] P3D ResNet[Qiu et al, 2017] MiCT-Net[Zhou et al, 2018] ARTNet[Wang et al, 2018a] I3D-RGB[Carreira and Zisserman, 2017] R(2+1)D-34-RGB[Tran et al, 2018] MRST-Met[Wu et al, 2019a]
  • Results:

    Table 3 shows the performance comparison of the proposed MSTI-Net against ten state-of-the-art methods in terms of Top-1 classification accuracy on Kinetics-400.
  • The authors' MSTI-Net improves the second best compared method SlowFast[Feichtenhofer et al, 2019] by 0.5% in terms of Top-1 classification accuracy.
  • The total number of parameters and FLOPs of the MSTI-Net are much fewer than those of most methods in the table.
  • Compared to the second best method MRST-Net[Wu et al, 2019a] on
  • Conclusion:

    The authors address the problem of building highly efficient deep neural networks for human action recognition from the perspectives of generating multi-scale representations and integrating multi-scale spatial-temporal features.
  • The authors propose a novel Multi-Scale Spatial-Temporal Integration Convolutional Tube in which the multi-scale convolution block generates multi-scale spatial appearance and temporal motion, and the cross-scale attention weighted blocks perform feature recalibration by integrating multi-scale spatial and temporal features.
  • Benefiting from the two blocks, the MSTI-Net requires significantly less computational resources yet achieving the state-of-the-art action recognition accuracy
Tables
  • Table1: Architecture of the deep MSTI-Net. The details of each convolutional layer are shown in brackets, in the order of the repeat times, kernel, strides and output size. The dimensions of kernel and strides are given by time, height, and width. The dimensions of output size are given by time, height, width and number of channels
  • Table2: Ablation study. Performance of our proposed MSTI tube compared with P3D-B baseline and multi-scale convolution on UCF-101 and HMDB-51. They use the same network backbone and they are all pre-trained on Kinetics-400
  • Table3: Performance comparison with the state-of-the-art results on Kinetics-400 with only RGB frames as inputs. The dimensions of input are given by the number of frames in a clip, the number of channels, the frame height and width size. Here, “All” means using all frames in a video. Our detailed MSTI-Net architecture is shown in Table 1. #Params means the total number of model parameters and FLOPs means floating point operations which both are the significant indicators to measure the computational cost
  • Table4: Action recognition accuracy on UCF-101 and HMDB-51, averaged over three splits. The top part of the table refers to related methods with the Sports-1M pre-trained, the lower part refers to related methods with the Kinetics-400 pre-trained
Download tables as Excel
Related work
  • With the rapid development of convolutional neural networks in the field of image, the video field is becoming a more and more popular field people try to expand into. According to the types of convolutions used in features learning, existing action recognition works can be briefly divided into two categories: 2D CNN and 3D CNN based methods. 3.1 MSTI Tube

    The multi-scale spatial-temporal integration convolutional tube (MSTI) applies the bottleneck structure, as shown in Fig.1, which employs two 1 × 1 × 1 convolutional layers at (a) MSTI-spatial branch (b) MSTI-temporal branch (c) Spatial and temporal CAW blocks both ends of the path to reduce and restore the channel dimensions respectively, decreasing the overall computational cost. In this section, we will first introduce concrete details of the composition of the MSTI tube, i.e., the multi-scale convolution block and the spatial and temporal cross-scale attention weighted blocks. We then present our robust and efficient deep network, MSTI-Net for human action recognition.

    Multi-Scale Convolution Block

    In the multi-scale convolution block, we first evenly slice the 3D input feature maps X ∈ RL×H×W ×C into four groups, denoted by Xi ∈ RL×H×W ×C , where i ∈ {1, 2, 3, 4}, and L, H, W, Crefer to the length, height, width and the number of group channels, respectively.

    In the MSTI-spatial branch, each group Xi has a corresponding 1 × 3 × 3 spatial convolution, except that the first group X1 is followed by a 1 × 1 × 1 spatial convolution. The corresponding spatial convolution of each group are denoted by Ksi , and the outputs of each spatial convolution are named Si. The whole multi-scale spatial convolution architecture presents a stepped structure, as shown in Fig.2(a). The output Si can be written as: Si =
Funding
  • This work was supported by the National Key R&D Program of China under Grant 2017YFB1300201, the National Natural Science Foundation of China (NSFC) under Grants U19B2038, 61620106009 and 61725203 as well as the Fundamental Research Funds for the Central Universities under Grant WK2100100030
Reference
  • [Carreira and Zisserman, 2017] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017.
    Google ScholarLocate open access versionFindings
  • [Donahue et al., 2015] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Longterm recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625–2634, 2015.
    Google ScholarLocate open access versionFindings
  • [Feichtenhofer et al., 2016] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, pages 3468– 3476, 2016.
    Google ScholarLocate open access versionFindings
  • [Feichtenhofer et al., 2019] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In ICCV, pages 6202–6211, 2019.
    Google ScholarLocate open access versionFindings
  • [Gao et al., 2019] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture. arXiv preprint arXiv:1904.01169, 2019.
    Findings
  • [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • [He et al., 2018] Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Liming Wang, and Shilei Wen. Stnet: Local and global spatial-temporal modeling for action recognition. arXiv preprint arXiv:1811.01549, 2018.
    Findings
  • [Hu et al., 2018] Jie Hu, Li Shen, and Gang Sun. Squeezeand-excitation networks. In CVPR, pages 7132–7141, 2018.
    Google ScholarLocate open access versionFindings
  • [Karpathy et al., 2014] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014.
    Google ScholarLocate open access versionFindings
  • [Kay et al., 2017] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
    Findings
  • [Kuehne et al., 2013] Hilde Kuehne, Hueihan Jhuang, Rainer Stiefelhagen, and Thomas Serre. Hmdb51: A large video database for human motion recognition. In HPCSE, pages 571–582.
    Google ScholarLocate open access versionFindings
  • [Liu et al., 2016] Jiawei Liu, Zheng-Jun Zha, QI Tian, Dong Liu, Ting Yao, Qiang Ling, and Tao Mei. Multi-scale triplet cnn for person re-identification. In ACM Multimedia, pages 192–196, 2016.
    Google ScholarLocate open access versionFindings
  • [Nair and Hinton, 2010] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.
    Google ScholarLocate open access versionFindings
  • [Qiu et al., 2017] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, pages 5533–5541, 2017.
    Google ScholarLocate open access versionFindings
  • [Simonyan and Zisserman, 2014] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
    Google ScholarLocate open access versionFindings
  • [Soomro et al., 2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
    Findings
  • [Szegedy et al., 2017] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, 2017.
    Google ScholarLocate open access versionFindings
  • [Tran et al., 2015] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
    Google ScholarLocate open access versionFindings
  • [Tran et al., 2017] Du Tran, Jamie Ray, Zheng Shou, ShihFu Chang, and Manohar Paluri. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017.
    Findings
  • [Tran et al., 2018] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, pages 6450–6459, 2018.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2016] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20– 36.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2018a] Limin Wang, Wei Li, Wen Li, and Luc Van Gool. Appearance-and-relation networks for video classification. In CVPR, pages 1430–1439, 2018.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2018b] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, pages 7794–7803, 2018.
    Google ScholarLocate open access versionFindings
  • [Wu et al., 2019a] Haoze Wu, Jiawei Liu, Zheng-Jun Zha, Zhenzhong Chen, and Xiaoyan Sun. Mutually reinforced spatio-temporal convolutional tube for human action recognition. In IJCAI, pages 968–974, 2019.
    Google ScholarLocate open access versionFindings
  • [Wu et al., 2019b] Haoze Wu, Zheng-Jun Zha, Xin Wen, Zhenzhong Chen, Dong Liu, and Xuejin Chen. Cross-fiber spatial-temporal co-enhanced networks for video action recognition. In ACM Multimedia, pages 620–628, 2019.
    Google ScholarLocate open access versionFindings
  • [Xie et al., 2018] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, pages 305–321, 2018.
    Google ScholarLocate open access versionFindings
  • [Zhou et al., 2018] Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun Zeng. Mict: Mixed 3d/2d convolutional tube for human action recognition. In CVPR, pages 449– 458, 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments