Disentangling Motion, Foreground and Background Features in Videos.
arXiv: Computer Vision and Pattern Recognition(2017)
摘要
This paper instroduces an unsupervised framework toextract semantically rich features for video representation.Inspired by how the human visual system groups objectsbased on motion cues, we propose a deep convolutionalneural network that disentangles motion, foreground andbackground information. The proposed architecture consistsof a 3D convolutional feature encoder for blocks of 16frames, which is trained for reconstruction tasks over thefirst and last frames of the sequence. The model is trainedwith a fraction of videos from the UCF-101 dataset taking asground truth the bounding boxes around the activity regions.Qualitative results indicate that the network can successfullyupdate the foreground appearance based on pure-motionfeatures. The benefits of these learned features are shownin a discriminative classification task when compared witha random initialization of the network weights, providing again of accuracy above the 10%.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要