VISREAS: Complex Visual Reasoning with Unanswerable Questions
arxiv(2024)
摘要
Verifying a question's validity before answering is crucial in real-world
applications, where users may provide imperfect instructions. In this scenario,
an ideal model should address the discrepancies in the query and convey them to
the users rather than generating the best possible answer. Addressing this
requirement, we introduce a new compositional visual question-answering
dataset, VISREAS, that consists of answerable and unanswerable visual queries
formulated by traversing and perturbing commonalities and differences among
objects, attributes, and relations. VISREAS contains 2.07M semantically diverse
queries generated automatically using Visual Genome scene graphs. The unique
feature of this task, validating question answerability with respect to an
image before answering, and the poor performance of state-of-the-art models
inspired the design of a new modular baseline, LOGIC2VISION that reasons by
producing and executing pseudocode without any external modules to generate the
answer. LOGIC2VISION outperforms generative models in VISREAS (+4.82
LLaVA-1.5; +12.23
performance against the classification models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要