CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
CoRR(2024)
摘要
Recent years have witnessed a significant increase in the performance of
Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as
CLIP, have been leveraged in multiple settings and demonstrated remarkable
performance across several tasks. Such models excel at object-centric
recognition yet learn text representations that seem invariant to word order,
failing to compose known concepts in novel ways. However, no evidence exists
that any VLM, including large-scale single-stream models such as GPT-4V,
identifies compositions successfully. In this paper, we introduce a framework
to significantly improve the ability of existing models to encode compositional
language, with over 10
while maintaining or improving the performance on standard object-recognition
and retrieval benchmarks. Our code and pre-trained models are publicly
available at https://github.com/netflix/clove.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要