MeaCap: Memory-Augmented Zero-shot Image Captioning
CVPR 2024(2024)
摘要
Zero-shot image captioning (IC) without well-paired image-text data can be
divided into two categories, training-free and text-only-training. Generally,
these two types of methods realize zero-shot IC by integrating pretrained
vision-language models like CLIP for image-text similarity evaluation and a
pre-trained language model (LM) for caption generation. The main difference
between them is whether using a textual corpus to train the LM. Though
achieving attractive performance w.r.t. some metrics, existing methods often
exhibit some common drawbacks. Training-free methods tend to produce
hallucinations, while text-only-training often lose generalization capability.
To move forward, in this paper, we propose a novel Memory-Augmented zero-shot
image Captioning framework (MeaCap). Specifically, equipped with a textual
memory, we introduce a retrieve-then-filter module to get key concepts that are
highly related to the image. By deploying our proposed memory-augmented
visual-related fusion score in a keywords-to-sentence LM, MeaCap can generate
concept-centered captions that keep high consistency with the image with fewer
hallucinations and more world-knowledge. The framework of MeaCap achieves the
state-of-the-art performance on a series of zero-shot IC settings. Our code is
available at https://github.com/joeyz0z/MeaCap.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要