Trust Your Partner’s Friends: Hierarchical Cross-Modal Contrastive Pre-Training for Video-Text Retrieval

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2023)

引用 0|浏览23
暂无评分
摘要
Video-text retrieval has greatly benefited from the massive web video in recent years, while the performance is still limited to the weak supervision from the uncurated data. In this work, we propose to leverage the well-represented information of each original modality and exploit complementary information in two views of the same video, i.e., video clips and captions, by using one view to obtain positive samples with the neighboring samples of the other. Respecting the hierarchical organization of real-world data, we further design a hierarchical cross-modal pre-training method (HCP) to learn good representations in the common embedding space. We evaluate the pre-trained model on three downstream tasks, i.e. text-to-video retrieval, action step localization and video question answering and our method outperforms previous works under the same setting.
更多
查看译文
关键词
Video-Text Retrieval,Cross-Modal Retrieval,Vision Language Tasks,Contrastive Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要