HallE-Control: Controlling Object Hallucination in Large Multimodal Models
arxiv(2023)
摘要
Current Large Multimodal Models (LMMs) achieve remarkable progress, yet there
remains significant uncertainty regarding their ability to accurately apprehend
visual details, that is, in performing detailed captioning. To address this, we
introduce CCEval, a GPT-4 assisted evaluation method for detailed
captioning. Interestingly, while LMMs demonstrate minimal object existence
hallucination in existing VQA benchmarks, our proposed evaluation reveals
continued susceptibility to such hallucinations. In this paper, we make the
first attempt to investigate such hallucination from different aspects,
including image resolution, the language decoder size, and instruction data
amount, quality, granularity. Our findings underscore the unwarranted inference
when the language description includes details at a finer object granularity
than what the vision module can ground or verify, thus inducing hallucination.
To control such hallucinations, we further attribute the reliability of
captioning to contextual knowledge (involving only contextually grounded
objects) and parametric knowledge (containing inferred objects by the model).
Thus, we introduce HallE-Control, a controllable LMM in terms of
Hallucination in object Existence. HallE-Control can
condition the captioning to shift between (i) exclusively depicting contextual
knowledge for grounded objects and (ii) blending it with parametric knowledge
to imagine inferred objects. Our method reduces hallucination by 44
to LLaVA_7B and maintains the object coverage.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要