Spatial-Temporal Graphs for Cross-Modal Text2Video Retrieval

IEEE TRANSACTIONS ON MULTIMEDIA(2022)

引用 48|浏览69
暂无评分
摘要
Cross-modal text to video retrieval aims to find relevant videos given text queries, which is crucial for various real-world applications. The key to address this task is to build the correspondence between video and text such that related samples from different modalities can be aligned. As the text (sentence) contains both nouns and verbs representing objects as well as their interactions, retrieving relevant videos requires a fine-grained understanding of video contents—not only the semantic concepts (i.e., objects) but also the interactions between them. Nevertheless, current approaches mostly represent videos with aggregated frame-level features for the learning of joint space and ignore the information of object interactions, which usually results in suboptimal retrieval performance. To improve the performance of cross-modal video retrieval, this paper proposes a framework that models videos as spatial-temporal graphs where nodes correspond to visual objects and edges correspond to the relations/interactions between objects. With the spatial-temporal graphs, object interactions in frame sequences can be captured to enrich the video representations for joint space learning. Specifically, Graph Convolutional Network is introduced to learn the representations on spatial-temporal graphs, aiming to encode spatial-temporal interactions between objects; while BERT is introduced to dynamically encode the sentence according to the context for cross-modal retrieval. Extensive experiments verify the effectiveness of the proposed framework and it achieves promising performances on both MSR-VTT and LSMDC datasets.
更多
查看译文
关键词
Visualization, Semantics, Bit error rate, Encoding, Task analysis, Feature extraction, Microphones, Cross-modal retrieval, video retrieval, spatial-temporal graphs, cross-modal learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要