Reading order detection in visually-rich documents with multi-modal layout-aware relation prediction

Pattern Recognition(2024)

引用 0|浏览2
暂无评分
摘要
Reading order detection aims to arrange the text logically, which is essential in understanding visual documents. Current methods mostly model the problem as a sequence generation task, which use insufficient modalities information ignore the various reading habits under different document layouts, leading to the lack of robustness for some complex scenarios. To address these challenges, we present a novel approach with the Multi-Modal Layout-Aware Relation Prediction. It employs a straightforward yet highly effective task formulation for predicting the order relation between text instances. Our model leverages visual, semantic, and positional features, with the positional features being adaptively generated through a layout-aware position embedding module. Then, different modality features are enhanced via a two-staged position-guided multi-modal fusion module. Additionally, we introduce two novel loss functions, Degree Loss and Cycle Loss, to effectively impose network constraints at multiple levels. Our experimental results, conducted on three real-world datasets, demonstrate that our proposed method achieves a new state-of-the-art level of performance.
更多
查看译文
关键词
Document understanding,Reading order detection,Layout-aware positional embedding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要