A Closer Look at Spatiotemporal Convolutions for Action Recognition

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(2018)

引用 3235|浏览571
暂无评分
摘要
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101 and HMDB51.
更多
查看译文
关键词
spatiotemporal convolutions,action recognition,video analysis,3D convolutional filters,spatial components,temporal components,R(2+1)D spatiotemporal convolutional block,2D CNNs,residual learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要