MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
arxiv(2024)
摘要
Image retrieval, i.e., finding desired images given a reference image,
inherently encompasses rich, multi-faceted search intents that are difficult to
capture solely using image-based measures. Recent work leverages text
instructions to allow users to more freely express their search intents.
However, existing work primarily focuses on image pairs that are visually
similar and/or can be characterized by a small set of pre-defined relations.
The core thesis of this paper is that text instructions can enable retrieving
images with richer relations beyond visual similarity. To show this, we
introduce MagicLens, a series of self-supervised image retrieval models that
support open-ended instructions. MagicLens is built on a key novel insight:
image pairs that naturally occur on the same web pages contain a wide range of
implicit relations (e.g., inside view of), and we can bring those implicit
relations explicit by synthesizing instructions via large multimodal models
(LMMs) and large language models (LLMs). Trained on 36.7M (query image,
instruction, target image) triplets with rich semantic relations mined from the
web, MagicLens achieves comparable or better results on eight benchmarks of
various image retrieval tasks than prior state-of-the-art (SOTA) methods.
Remarkably, it outperforms previous SOTA but with a 50X smaller model size on
multiple benchmarks. Additional human analyses on a 1.4M-image unseen corpus
further demonstrate the diversity of search intents supported by MagicLens.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要