VideoWhisper: Toward Discriminative Unsupervised Video Feature Learning With Attention-Based Recurrent Neural Networks.
IEEE Trans. Multimedia(2017)
摘要
We present VideoWhisper, a novel approach for unsupervised video representation learning. Based on the observation that the frame sequence encodes the temporal dynamics of a video (e.g., object movement and event evolution), we treat the frame sequential order as a self-supervision to learn video representations. Unlike other unsupervised video feature learning methods based on frame-level feature reconstruction that is sensitive to visual variance, VideoWhisper is driven by a novel video sequence-to-whisper learning strategy. Specifically, for each video sequence, we use a prelearned visual dictionary to generate a sequence of high-level semantics, dubbed whisper, which can be considered as the language describing the video dynamics. In this way, we model VideoWhisper as an end-to-end sequence-to-sequence learning model using attention-based recurrent neural networks. This model is trained to predict the whisper sequence and hence it is able to learn the temporal structure of videos. We propose two ways to generate video representation from the model. Through extensive experiments on two real-world video datasets, we demonstrate that video representation learned by V ideoWhisper is effective to boost fundamental multimedia applications such as video retrieval and event classification.
更多查看译文
关键词
Semantics,Feature extraction,Image reconstruction,Video sequences,Visualization,Recurrent neural networks,Learning systems
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络