Effectiveness Assessment of Recent Large Vision-Language Models
arxiv(2024)
摘要
The advent of large vision-language models (LVLMs) represents a noteworthy
advancement towards the pursuit of artificial general intelligence. However,
the extent of their efficacy across both specialized and general tasks warrants
further investigation. This article endeavors to evaluate the competency of
popular LVLMs in specialized and general tasks, respectively, aiming to offer a
comprehensive comprehension of these innovative methodologies. To gauge their
efficacy in specialized tasks, we tailor a comprehensive testbed comprising
three distinct scenarios: natural, healthcare, and industrial, encompassing six
challenging tasks. These tasks include salient, camouflaged, and transparent
object detection, as well as polyp and skin lesion detection, alongside
industrial anomaly detection. We examine the performance of three recent
open-source LVLMs – MiniGPT-v2, LLaVA-1.5, and Shikra – in the realm of
visual recognition and localization. Moreover, we conduct empirical
investigations utilizing the aforementioned models alongside GPT-4V, assessing
their multi-modal understanding capacities in general tasks such as object
counting, absurd question answering, affordance reasoning, attribute
recognition, and spatial relation reasoning. Our investigations reveal that
these models demonstrate limited proficiency not only in specialized tasks but
also in general tasks. We delve deeper into this inadequacy and suggest several
potential factors, including limited cognition in specialized tasks, object
hallucination, text-to-image interference, and decreased robustness in complex
problems. We hope this study would provide valuable insights for the future
development of LVLMs, augmenting their power in coping with both general and
specialized applications.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要