Deep Local Video Feature for Action Recognition

2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)(2017)

引用 146|浏览182
暂无评分
摘要
We investigate the problem of representing an entire video using CNN features for human action recognition. Currently, limited by GPU memory, we have not been able to feed a whole video into CNN/RNNs for end-to-end learning. A common practice is to use sampled frames as inputs and video labels as supervision. One major problem of this popular approach is that the local samples may not contain the information indicated by global labels. To deal with this problem, we propose to treat the deep networks trained on local inputs as local feature extractors. After extracting local features, we aggregate them into global features and train another mapping function on the same training data to map the global features into global labels. We study a set of problems regarding this new type of local features such as how to aggregate them into global features. Experimental results on HMDB51 and UCF101 datasets show that, for these new local features, a simple maximum pooling on the sparsely sampled features lead to significant performance improvement.
更多
查看译文
关键词
performance improvement,sparsely sampled local features,HMDB51 datasets,UCF101 datasets,noisy local labels,second classification stage,video-level labels,global features,local feature extractors,local inputs,deep network training,GPU memory limitations,RNN,end-to-end learning,human action recognition,action recognition,deep local video feature,CNN features
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要