Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
arxiv(2024)
摘要
Large vision-language models (LVLMs), exemplified by GPT-4V, excel across
diverse tasks involving concrete images from natural scenes. However, their
ability to interpret abstract figures, such as geometry shapes and scientific
plots, remains limited due to a scarcity of training datasets in scientific
domains. To fill this gap, we introduce Multimodal ArXiv, consisting of
ArXivCap and ArXivQA, for enhancing LVLMs scientific comprehension. ArXivCap is
a figure-caption dataset comprising 6.4M images and 3.9M captions sourced from
572K ArXiv papers spanning various scientific domains. Drawing from ArXivCap,
we introduce ArXivQA, a question-answering dataset generated by prompting
GPT-4V based on scientific figures. ArXivQA greatly enhances LVLMs'
mathematical reasoning capabilities, achieving a 10.4
on a multimodal mathematical reasoning benchmark. Furthermore, employing
ArXivCap, we devise four vision-to-text tasks for benchmarking LVLMs.
Evaluation results with state-of-the-art LVLMs underscore their struggle with
the nuanced semantics of academic figures, with domain-specific training
yielding substantial performance gains. Our error analysis uncovers
misinterpretations of visual context, recognition errors, and the production of
overly simplified captions by current LVLMs, shedding light on future
improvements.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要