Continuous document layout analysis: Human-in-the-loop AI-based data curation, database, and evaluation in the domain of public affairs

Information Fusion(2024)

引用 0|浏览2
In the digital era, the amount of digital documents generated each day have being increasing exponentially with the years, to a point where it is unfeasible to process them manually. Thus, there has been growing interest from different sectors to develop automatic tools to process digital documents in an automatic manner. Yet useful, this task is challenging, due to both the large variability and the multimodal nature inherent to the problem. In most cases, a text-only approach often falls short in comprehending the information conveyed by diverse components of varying significance. In this regard, Document Layout Analysis (DLA) has been an interesting research field for many years, whose objective it to detect and classify the basic components of a document. Thus, is an interesting task to obtain a first understanding on how the different components of the document interact with each other. In this work, we used a semi-automatic procedure to annotate digital documents with different layout labels, including 4 basic layout blocks and 4 text categories. We apply this procedure to collect a novel database for DLA in the public affairs domain, the PALdb database, using a set of 24 data sources from the Spanish Administration. The database comprises 37.9K documents with more than 441K document pages, and more than 8M labels associated to 8 layout block units. The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%. We also present a novel application of Quickest Change Detection (QCD) techniques on the DLA domain, which we use to continuously detect changes in the layout of the documents from multiple sources.
Document layout analysis,Document understanding,Legal domain,QCD-based detection,Natural language processing,Human-in-the-loop
AI 理解论文
Chat Paper