BloomVQA: Assessing Hierarchical Multi-modal Comprehension
CoRR(2023)
摘要
We propose a novel VQA dataset, based on picture stories designed for
educating young children, that aims to facilitate comprehensive evaluation and
characterization of vision-language models on comprehension tasks. Unlike
current VQA datasets that often focus on fact-based memorization and simple
reasoning tasks without principled scientific grounding, we collect data
containing tasks reflecting different levels of comprehension and underlying
cognitive processes, as laid out in Bloom's Taxonomy, a classic framework
widely adopted in education research. The proposed BloomVQA dataset can be
mapped to a hierarchical graph-based representation of visual stories, enabling
automatic data augmentation and novel measures characterizing model consistency
across the underlying taxonomy. We demonstrate graded evaluation and
reliability analysis based on our proposed consistency metrics on
state-of-the-art vision-language models. Our results suggest that, while
current models achieve the most gain on low-level comprehension tasks, they
generally fall short on high-level tasks requiring more advanced comprehension
and cognitive skills, as 38.0% drop in VQA accuracy is observed comparing
lowest and highest level tasks. Furthermore, current models show consistency
patterns misaligned with human comprehension in various scenarios, suggesting
emergent structures of model behaviors.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要