An Attentive Sequence to Sequence Translator for Localizing Video Clips by Natural Language

IEEE Transactions on Multimedia(2020)

引用 15|浏览358
暂无评分
摘要
We propose a novel attentive sequence to sequence translator (ASST) for localizing video clips by natural language descriptions. We make two contributions. First, we propose an attentive mechanism that aligns natural language descriptions and video content. A bi-directional Recurrent Neural Network (RNN) parses natural language descriptions in two directions. Given a video-description pair, ASST generates a vector sequence representation. Each vector represents a video frame, conditioned by the description. The vector sequence representation not only preserves the temporal dependencies between the frames, but also provides an effective way to perform frame-level videolanguage matching. The attentive model then aligns words to each frame, thereby resulting in a more detailed understanding of video content and description semantics. Second, we design a hierarchical architecture for the network to jointly model language descriptions and video content. The hierarchical architecture exploits video content with multiple granularities, ranging from subtle details to global context. The integration of the multiple granularities yields a robust representation for multi-level videolanguage abstraction. We validate the effectiveness of our ASST on two large-scale datasets. Our ASST outperforms the state-ofthe-art by 4.28% in Rank@1 on the DiDeMo dataset. On the Charades-STA dataset, we significantly improve the state-of-theart by 13.41% in Recall@1,IoU = 0.5.
更多
查看译文
关键词
Temporal action localization,sequence to sequence learning,natural language guided detection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要