ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation.
CoRR(2023)
摘要
State-of-the-art vision-language models (VLMs) still have limited performance
in structural knowledge extraction, such as relations between objects. In this
work, we present ViStruct, a training framework to learn VLMs for effective
visual structural knowledge extraction. Two novel designs are incorporated.
First, we propose to leverage the inherent structure of programming language to
depict visual structural information. This approach enables explicit and
consistent representation of visual structural information of multiple
granularities, such as concepts, relations, and events, in a well-organized
structured format. Second, we introduce curriculum-based learning for VLMs to
progressively comprehend visual structures, from fundamental visual concepts to
intricate event structures. Our intuition is that lower-level knowledge may
contribute to complex visual structure understanding. Furthermore, we compile
and release a collection of datasets tailored for visual structural knowledge
extraction. We adopt a weakly-supervised approach to directly generate visual
event structures from captions for ViStruct training, capitalizing on abundant
image-caption pairs from the web. In experiments, we evaluate ViStruct on
visual structure prediction tasks, demonstrating its effectiveness in improving
the understanding of visual structures. The code is public at
\url{https://github.com/Yangyi-Chen/vi-struct}.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要