Composed Video Retrieval via Enriched Context and Discriminative Embeddings
arxiv(2024)
摘要
Composed video retrieval (CoVR) is a challenging problem in computer vision
which has recently highlighted the integration of modification text with visual
queries for more sophisticated video search in large databases. Existing works
predominantly rely on visual queries combined with modification text to
distinguish relevant videos. However, such a strategy struggles to fully
preserve the rich query-specific context in retrieved target videos and only
represents the target video using visual embedding. We introduce a novel CoVR
framework that leverages detailed language descriptions to explicitly encode
query-specific contextual information and learns discriminative embeddings of
vision only, text only and vision-text for better alignment to accurately
retrieve matched target videos. Our proposed framework can be flexibly employed
for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on
three datasets show that our approach obtains state-of-the-art performance for
both CovR and zero-shot CoIR tasks, achieving gains as high as around 7
terms of recall@K=1 score. Our code, models, detailed language descriptions for
WebViD-CoVR dataset are available at
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要