Do LLMs Understand Visual Anomalies? Uncovering LLM Capabilities in Zero-shot Anomaly Detection
arxiv(2024)
摘要
Large vision-language models (LVLMs) are markedly proficient in deriving
visual representations guided by natural language. Recent explorations have
utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by
pairing images with textual descriptions indicative of normal and abnormal
conditions, referred to as anomaly prompts. However, existing approaches depend
on static anomaly prompts that are prone to cross-semantic ambiguity, and
prioritize global image-level representations over crucial local pixel-level
image-to-text alignment that is necessary for accurate anomaly localization. In
this paper, we present ALFA, a training-free approach designed to address these
challenges via a unified model. We propose a run-time prompt adaptation
strategy, which first generates informative anomaly prompts to leverage the
capabilities of a large language model (LLM). This strategy is enhanced by a
contextual scoring mechanism for per-image anomaly prompt adaptation and
cross-semantic ambiguity mitigation. We further introduce a novel fine-grained
aligner to fuse local pixel-level semantics for precise anomaly localization,
by projecting the image-text alignment from global to local semantic spaces.
Extensive evaluations on the challenging MVTec and VisA datasets confirm ALFA's
effectiveness in harnessing the language potential for zero-shot VAD, achieving
significant PRO improvements of 12.1
state-of-the-art zero-shot VAD approaches.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要