Balanced Similarity with Auxiliary Prompts: Towards Alleviating Text-to-Image Retrieval Bias for CLIP in Zero-shot Learning

CoRR(2024)

引用 0|浏览1
暂无评分
摘要
CLIP has the ability to align texts and images and is nearly the most frequently used foundation model in cross-modal zero-shot learning. However, our experimental findings reveal that CLIP suffers from a bias in text-to-image retrieval, resulting in a decrease in CLIP's zero-shot learning performance. We analytically discover that the bias partly arises from the imbalanced range of similarity scores obtained by CLIP. Accordingly, we propose a Balanced Similarity with Auxiliary Prompts (BSAP) to mitigate the text-to-image retrieval bias of CLIP. Specifically, our BSAP designs auxiliary prompts for CLIP to calculate multiple similarity scores for the retrieval images and then normalizes the scores between each image and the given query text as well as our auxiliary prompts to obtain balanced similarity scores. The balanced similarity score of the given query text is used for the final retrieval. In addition, we attempt to adopt a hybrid similarity that combines our BSAP with the original similarity of CLIP to obtain a more robust outcome. Extensive experiments on two typical zero-shot learning tasks,i.e., Referring Expression Comprehension (REC) and Referring Image Segmentation (RIS), are conducted to demonstrate the effectiveness of our BSAP. Specifically, when using the val dataset of RefCOCO in REC, BSAP increases CLIP's performance by 20.6
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要