Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models
arxiv(2024)
摘要
Large Vision-Language Models (LVLMs) are gaining traction for their
remarkable ability to process and integrate visual and textual data. Despite
their popularity, the capacity of LVLMs to generate precise, fine-grained
textual descriptions has not been fully explored. This study addresses this gap
by focusing on distinctiveness and fidelity, assessing how
models like Open-Flamingo, IDEFICS, and MiniGPT-4 can distinguish between
similar objects and accurately describe visual features. We proposed the
Textual Retrieval-Augmented Classification (TRAC) framework, which, by
leveraging its generative capabilities, allows us to delve deeper into
analyzing fine-grained visual description generation. This research provides
valuable insights into the generation quality of LVLMs, enhancing the
understanding of multimodal language models. Notably, MiniGPT-4 stands out for
its better ability to generate fine-grained descriptions, outperforming the
other two models in this aspect. The code is provided at
.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要