TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias
arxiv(2024)
摘要
We identify a critical bias in contemporary CLIP-based models, which we
denote as single tag bias. This bias manifests as a disproportionate
focus on a singular tag (word) while neglecting other pertinent tags, stemming
from CLIP's text embeddings that prioritize one specific tag in image-text
relationships. When deconstructing text into individual tags, only one tag
tends to have high relevancy with CLIP's image embedding, leading to an
imbalanced tag relevancy. This results in an uneven alignment among multiple
tags present in the text. To tackle this challenge, we introduce a novel
two-step fine-tuning approach. First, our method leverages the similarity
between tags and their nearest pixels for scoring, enabling the extraction of
image-relevant tags from the text. Second, we present a self-distillation
strategy aimed at aligning the combined masks from extracted tags with the
text-derived mask. This approach mitigates the single tag bias, thereby
significantly improving the alignment of CLIP's model without necessitating
additional data or supervision. Our technique demonstrates model-agnostic
improvements in multi-tag classification and segmentation tasks, surpassing
competing methods that rely on external resources. Code is available at
https://github.com/shjo-april/TTD.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要