DiT: Self-supervised Pre-training for Document Image Transformer

Junlong Li,Yiheng Xu,Tengchao Lv,Lei Cui,Cha Zhang,Furu Wei

International Multimedia Conference（2022）

引用 97|浏览123

暂无评分

摘要

ABSTRACTImage Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 - 92.69), document layout analysis (91.0 - 94.9), table detection (94.23 - 96.55) and text detection for OCR (93.07 - 94.29). The code and pre-trained models are publicly available at https://aka.ms/msdit.

查看译文

关键词

document image transformer,self-supervised,pre-training

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要