Video Captioning Based on the Spatial-Temporal Saliency Tracing.

ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I(2018)

引用 0|浏览91
暂无评分
摘要
Video captioning is a crucial task for video understanding and has attracted much attention recently. Regions-of-Interest (ROI) of video always contains the most interesting information for the audience. Different from the ROI of images, the ROI of videos has the property of temporally-continuity (e. g. a moving object, or an action in video clips), which is the focus of people's attention. Inspired by this insight we propose an approach to automatically trace the Spatial-Temporal Saliency content for video captioning by catching the temporal structure of ROI candidates. To this aim, we employ a set of modules named tracing LSTMs, each of which traces a single ROI candidate of feature maps across the entire video. The temporal structure of global features and ROI features are combined to obtain a rough understanding of video content and information of ROI, which is set as the initial states of the decoder to generate captions. We verify the effectiveness of our method on the public benchmark: the Microsoft Video Description Corpus (MSVD). The experimental results demonstrate that catching temporal ROI information by tracing LSTMs enhances the representation of input videos and achieves the state-of-the-art results.
更多
查看译文
关键词
Video captioning,Regions-of-Interest (ROI),Spatial-Temporal saliency,Tracing LSTM
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要