Balanced Similarity with Auxiliary Prompts: Towards Alleviating Text-to-Image Retrieval Bias for CLIP in Zero-shot Learning
CoRR(2024)
摘要
CLIP has the ability to align texts and images and is nearly the most
frequently used foundation model in cross-modal zero-shot learning. However,
our experimental findings reveal that CLIP suffers from a bias in text-to-image
retrieval, resulting in a decrease in CLIP's zero-shot learning performance. We
analytically discover that the bias partly arises from the imbalanced range of
similarity scores obtained by CLIP. Accordingly, we propose a Balanced
Similarity with Auxiliary Prompts (BSAP) to mitigate the text-to-image
retrieval bias of CLIP. Specifically, our BSAP designs auxiliary prompts for
CLIP to calculate multiple similarity scores for the retrieval images and then
normalizes the scores between each image and the given query text as well as
our auxiliary prompts to obtain balanced similarity scores. The balanced
similarity score of the given query text is used for the final retrieval. In
addition, we attempt to adopt a hybrid similarity that combines our BSAP with
the original similarity of CLIP to obtain a more robust outcome. Extensive
experiments on two typical zero-shot learning tasks,i.e., Referring Expression
Comprehension (REC) and Referring Image Segmentation (RIS), are conducted to
demonstrate the effectiveness of our BSAP. Specifically, when using the val
dataset of RefCOCO in REC, BSAP increases CLIP's performance by 20.6
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要