Structured Stochastic Recurrent Network for Linguistic Video Prediction

Proceedings of the 27th ACM International Conference on Multimedia(2019)

引用 8|浏览138
暂无评分
摘要
Intelligent machines are expected to have the capability of predicting impending occurrences. Inspired by video frame prediction and video captioning, we introduce a new task of Linguistic Video Prediction (LVP), which aims to predict the forthcoming events based on past video content and generate corresponding linguistic descriptions. Different from traditional video captioning that describes one specifically happened event, LVP is an open task involving one-to-many mappings between past and future. It explores different visual clues and associates them with potential events to generate corresponding descriptions. To address this task, we propose an end-to-end probabilistic approach named structured stochastic recurrent network (SRN) to characterize the one-to-many connections between past visual clues and possible future events. Specially, we first propose hierarchical-structured latent variables to represent the choice of event theme. Second, we introduce a stochastic attention module to capture the variations of the focused visual clues. Given a video, our model is able to generate multiple linguistic predictions by focusing on different event themes and visual clues. Experiments on ActivityNet dataset showed that the proposed model not only yields more informative predictions measured by BLEU, METEOR, ROUGE-L, CIDEr and SPICE scores, but also generates significantly more diverse predictions with higher recall rates to correctly hit the ground-truth.
更多
查看译文
关键词
hierarchical-structured latent variables, linguistic video prediction, structured stochastic recurrent network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要