Interaction-Integrated Network for Natural Language Moment Localization

IEEE TRANSACTIONS ON IMAGE PROCESSING(2021)

引用 34|浏览413
暂无评分
摘要
Natural language moment localization aims at localizing video clips according to a natural language description. The key to this challenging task lies in modeling the relationship between verbal descriptions and visual contents. Existing approaches often sample a number of clips from the video, and individually determine how each of them is related to the query sentence. However, this strategy can fail dramatically, in particular when the query sentence refers to some visual elements that appear outside of, or even are distant from, the target clip. In this paper, we address this issue by designing an Interaction-Integrated Network (I 2 N), which contains a few Interaction-Integrated Cells (I 2 Cs). The idea lies in the observation that the query sentence not only provides a description to the video clip, but also contains semantic cues on the structure of the entire video. Based on this, I 2 Cs go one step beyond modeling short-term contexts in the time domain by encoding long-term video content into every frame feature. By stacking a few I 2 Cs, the obtained network, I 2 N, enjoys an improved ability of inference, brought by both (I) multi-level correspondence between vision and language and (II) more accurate cross-modal alignment. When evaluated on a challenging video moment localization dataset named DiDeMo, I 2 N outperforms the state-of-the-art approach by a clear margin of 1.98%. On other two challenging datasets, Charades-STA and TACoS, I 2 N also reports competitive performance.
更多
查看译文
关键词
Visualization, Semantics, Location awareness, Task analysis, Linguistics, Convolution, Data models, Temporal action localization, cross-modal learning, vision-language understanding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要