PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
arxiv(2024)
摘要
Document Question Answering (QA) presents a challenge in understanding
visually-rich documents (VRD), particularly those dominated by lengthy textual
content like research journal articles. Existing studies primarily focus on
real-world documents with sparse text, while challenges persist in
comprehending the hierarchical semantic relations among multiple pages to
locate multimodal components. To address this gap, we propose PDF-MVQA, which
is tailored for research journal articles, encompassing multiple pages and
multimodal information retrieval. Unlike traditional machine reading
comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs
containing answers or visually rich document entities like tables and figures.
Our contributions include the introduction of a comprehensive PDF Document VQA
dataset, allowing the examination of semantically hierarchical layout
structures in text-dominant documents. We also present new VRD-QA frameworks
designed to grasp textual contents and relations among document layouts
simultaneously, extending page-level understanding to the entire multi-page
document. Through this work, we aim to enhance the capabilities of existing
vision-and-language models in handling challenges posed by text-dominant
documents in VRD-QA.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要