Many Hands Make Light Work: Transferring Knowledge From Auxiliary Tasks for Video-Text Retrieval.

user-5f8411ab4c775e9685ff56d3(2023)

引用 9|浏览5
暂无评分
摘要
The problem of video-text retrieval, which searches videos via natural language descriptions or vice versa, has attracted growing attention due to the explosive scale of videos produced every day. The dominant approaches for this problem follow the pipeline that firstly learns compact feature representations of videos and texts, and then jointly embeds them into a common feature space where matched video-text pairs are close and unmatched pairs are far away. However, most of them neither consider the structural similarities among cross-modal samples in a global view, nor leverage useful information from other relevant retrieval processes. We argue that both information has great potential for video-text retrieval. In this paper, we treat the relevant retrieval processes as auxiliary tasks and we extract useful knowledge from them by exploiting structural similarities via Graph Neural Networks (GNNs). We then progressively transfer the knowledge from auxiliary tasks in a general-to-specific manner to assist the main task of the current retrieval process. Specifically, for the retrieval of the given query, we first construct a sequence of query-graphs whose central queries are chosen from distant to close to the given query. Then we conduct knowledge-guided message passing in each query-graph to exploit regional structural similarities and gather knowledge of different levels from the updated query-graphs with a knowledge-based attention mechanism. Finally, we transfer the extracted useful knowledge from general to specific to assist the current retrieval process. Extensive experimental results show that our model outperforms the state-of-the-arts on four benchmarks.
更多
查看译文
关键词
Auxiliary tasks, graph neural networks, knowledge transfer, video-text retrieval
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要