LayoutLM: Pre-training of Text and Layout for Document Image Understanding

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Virtual Event CA USA July, 2020, pp. 1192-1200, 2020.

Cited by: 4|Bibtex|Views224|Links
EI
Keywords:
layout informationpre training techniqueMasked Visual-Language Modelinformation extractionform understandingMore(15+)
Weibo:
We evaluate the LayoutLM model on three tasks: form understanding, receipt understanding and scanned document image classification

Abstract:

Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we...More
0
Introduction
  • Document AI, or Document Intelligence1, is a relatively new research topic that refers to the techniques to automatically read, understand and analyze business documents.
  • Business documents are files that provide details related to a company’s internal and external transactions, which is shown in Figure 1.
  • They may be digital-born, occurring as electronic files, or they may be in scanned form that comes from written or printed on paper.
  • Understanding business documents is a very challenging task due to the diversity of layouts and formats, poor quality of scanned document images as well as the complexity of template structures
Highlights
  • Document AI, or Document Intelligence1, is a relatively new research topic that refers to the techniques to automatically read, understand and analyze business documents
  • Task #1: Masked Visual-Language Model Inspired by the masked language model, we propose the Masked Visual-language Model (MVLM) to learn the language representation with the clues of 2-D position embeddings and text embeddings
  • We evaluate the LayoutLM model on three different document image understanding tasks: Form Understanding, Receipt Understanding, and Document Image Classification
  • Form Understanding We evaluate the form understanding task on the FUNSD dataset
  • We present LayoutLM, a simple but effective pre-training technique with text and layout information in a single framework
  • We evaluate the LayoutLM model on three tasks: form understanding, receipt understanding and scanned document image classification
Methods
  • The performance of pre-trained models is largely determined by the scale and quality of datasets.
  • The authors need a large-scale scanned document image dataset to pre-train the LayoutLM model.
  • The authors' model is pre-trained on the IIT-CDIP Test Collection 1.0, which contains more than 6 million documents, with more than 11 million scanned document images.
  • The metadata contains erroneous and inconsistent tags, the scanned document images in this large-scale dataset are perfectly suitable for pre-training the model
Results
  • Form Understanding The authors evaluate the form understanding task on the FUNSD dataset. The experiment results are shown in Table 1.
  • The authors compare the LayoutLM model with two SOTA pre-trained NLP models: BERT and RoBERTa (Liu et al, 2019b).
  • Compared to BERT, the RoBERTa performs much better on this dataset as it is trained using larger data with more epochs.
  • With the BASE architecture, the LayoutLM model with 11M training data achieves 0.7866 in F1, which is much higher than BERT and RoBERTa with the similar size of parameters.
  • The authors add the MDC loss in the pre-training step and it does bring substantial improvements on the FUNSD dataset.
  • The LayoutLM model achieves the best performance of 0.7927 when using the text, layout and image information at the same time
Conclusion
  • The authors present LayoutLM, a simple but effective pre-training technique with text and layout information in a single framework.
  • The model can be trained in a self-supervised way based on large scale unlabeled scanned document images.
  • The authors will train LayoutLM using the LARGE architecture with text and layout, as well as involving image embeddings in the pre-training step.
Summary
  • Introduction:

    Document AI, or Document Intelligence1, is a relatively new research topic that refers to the techniques to automatically read, understand and analyze business documents.
  • Business documents are files that provide details related to a company’s internal and external transactions, which is shown in Figure 1.
  • They may be digital-born, occurring as electronic files, or they may be in scanned form that comes from written or printed on paper.
  • Understanding business documents is a very challenging task due to the diversity of layouts and formats, poor quality of scanned document images as well as the complexity of template structures
  • Methods:

    The performance of pre-trained models is largely determined by the scale and quality of datasets.
  • The authors need a large-scale scanned document image dataset to pre-train the LayoutLM model.
  • The authors' model is pre-trained on the IIT-CDIP Test Collection 1.0, which contains more than 6 million documents, with more than 11 million scanned document images.
  • The metadata contains erroneous and inconsistent tags, the scanned document images in this large-scale dataset are perfectly suitable for pre-training the model
  • Results:

    Form Understanding The authors evaluate the form understanding task on the FUNSD dataset. The experiment results are shown in Table 1.
  • The authors compare the LayoutLM model with two SOTA pre-trained NLP models: BERT and RoBERTa (Liu et al, 2019b).
  • Compared to BERT, the RoBERTa performs much better on this dataset as it is trained using larger data with more epochs.
  • With the BASE architecture, the LayoutLM model with 11M training data achieves 0.7866 in F1, which is much higher than BERT and RoBERTa with the similar size of parameters.
  • The authors add the MDC loss in the pre-training step and it does bring substantial improvements on the FUNSD dataset.
  • The LayoutLM model achieves the best performance of 0.7927 when using the text, layout and image information at the same time
  • Conclusion:

    The authors present LayoutLM, a simple but effective pre-training technique with text and layout information in a single framework.
  • The model can be trained in a self-supervised way based on large scale unlabeled scanned document images.
  • The authors will train LayoutLM using the LARGE architecture with text and layout, as well as involving image embeddings in the pre-training step.
Tables
  • Table1: Model accuracy (Precision, Recall, F1) on the FUNSD dataset
  • Table2: LayoutLMBASE (Text + Layout, MVLM) accuracy with different data and epochs on the FUNSD dataset
  • Table3: Different initialization methods for BASE and LARGE (Text + Layout, MVLM)
  • Table4: Model accuracy (Precision, Recall, F1) on the SROIE dataset
  • Table5: Classification accuracy on the RVL-CDIP dataset shown in Table 5. We can see that either BERT or RoBERTa underperforms the image-bases approaches, illustrating that text information is not sufficient for this task and it still needs layout and image features. We address this issue by using the LayoutLM model for this task. The experiment results show that, even without the image features, LayoutLM still outperforms the single model of the image-based approaches. After integrating the image embeddings, the LayoutLM achieves the accuracy of 94.42%, which is significantly better than several SOTA baselines for document image classification
Download tables as Excel
Related work
  • The research of Document Analysis and Recognition (DAR) dates to the early 1990s. The mainstream approaches can be divided into three categories: rule-based approaches, conventional machine learning approaches and deep learning approaches.

    4.1 RULE-BASED APPROACHES

    The rule-based approaches (Lebourgeois et al, 1992; O’Gorman, 1993; Ha et al, 1995b; Simon et al, 1997) contain two types of analysis methods: bottom-up and top-down. The bottom-up methods (Lebourgeois et al, 1992; Ha et al, 1995a; Simon et al, 1997) usually detected the connected components of black pixels as the basic computational units in document images, and the document segmentation process is to combine them into higher-level structures through different heuristics and label them according to different structural features. Docstrum algorithm (O’Gorman, 1993) is among the earliest successful bottom-up algorithms that is based on connected component analysis. It groups connected components on a polar structure to derive the final segmentation. Simon
Reference
  • Muhammad Zeshan Afzal, Andreas Kolsch, Sheraz Ahmed, and Marcus Liwicki. Cutting the error by half: Investigation of very deep cnn and advanced training strategies for document image classification. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 01:883–888, 2017.
    Google ScholarLocate open access versionFindings
  • Arindam Das, Saikat Roy, and Ujjwal Bhattacharya. Document image classification with intradomain transfer learning and stacked generalization of deep convolutional neural networks. 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3180–3185, 2018.
    Google ScholarLocate open access versionFindings
  • Tyler Dauphinee, Nikunj Patel, and Mohammad Mehdi Rashidi. Modular multimodal architecture for document classification. ArXiv, abs/1912.04376, 2019.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
    Google ScholarLocate open access versionFindings
  • Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
    Findings
  • Jaekyu Ha, Robert M Haralick, and Ihsin T Phillips. Document page decomposition by the bounding-box project. In Proceedings of 3rd International Conference on Document Analysis and Recognition, volume 2, pp. 1119–1122. IEEE, 1995a.
    Google ScholarLocate open access versionFindings
  • Jaekyu Ha, Robert M Haralick, and Ihsin T Phillips. Recursive xy cut using bounding boxes of connected components. In Proceedings of 3rd International Conference on Document Analysis and Recognition, volume 2, pp. 952–955. IEEE, 1995b.
    Google ScholarLocate open access versionFindings
  • Leipeng Hao, Liangcai Gao, Xiaohan Yi, and Zhi Tang. A table detection method for pdf documents based on convolutional neural networks. 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 287–292, 2016.
    Google ScholarLocate open access versionFindings
  • Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis. Evaluation of deep convolutional nets for document image classification and retrieval. 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 991–995, 2015.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017. URL http://arxiv.org/abs/1703.06870.
    Findings
  • Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. Funsd: A dataset for form understanding in noisy scanned documents. 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), 2:1–6, 2019.
    Google ScholarLocate open access versionFindings
  • Anoop R Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Hohne, and Jean Baptiste Faddoul. Chargrid: Towards understanding 2D documents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4459–4469, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1476. URL https://www.aclweb.org/anthology/ D18-1476.
    Locate open access versionFindings
  • Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016. URL https://arxiv.org/abs/1602.07332.
    Findings
  • Frank Lebourgeois, Z Bublinski, and H Emptoz. A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents. In Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol. II. Conference B: Pattern Recognition Methodology and Systems, pp. 272–276. IEEE, 1992.
    Google ScholarLocate open access versionFindings
  • D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard. Building a test collection for complex document information processing. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, pp. 665–666, New York, NY, USA, 2006. ACM. ISBN 1-59593-369-7. doi: 10.1145/1148170. 1148307. URL http://doi.acm.org/10.1145/1148170.1148307.
    Locate open access versionFindings
  • Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. Graph convolution for multimodal information extraction from visually rich documents. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pp. 32–39, Minneapolis, Minnesota, June 2019a. Association for Computational Linguistics. doi: 10.18653/v1/N19-2005. URL https://www.aclweb.org/anthology/N19-2005.
    Locate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke S. Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019b.
    Findings
  • S. Marinai, M. Gori, and G. Soda. Artificial neural networks for document analysis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(1):23–35, Jan 2005. ISSN 1939-3539. doi: 10.1109/TPAMI.2005.4.
    Locate open access versionFindings
  • L. O’Gorman. The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(11):1162–1173, Nov 1993. ISSN 1939-3539. doi: 10. 1109/34.244677.
    Locate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:1137–1149, 2015.
    Google ScholarLocate open access versionFindings
  • Ritesh Sarkhel and Arnab Nandi. Deterministic routing between layout abstractions for multi-scale classification of visually rich documents. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 3360–3366. International Joint Conferences on Artificial Intelligence Organization, 7 2019. doi: 10.24963/ijcai.2019/466. URL https://doi.org/10.24963/ijcai.2019/466.
    Locate open access versionFindings
  • Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 01:1162–1167, 2017.
    Google ScholarLocate open access versionFindings
  • Michael Shilman, Percy Liang, and Paul Viola. Learning nongenerative grammatical models for document analysis. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 2, pp. 962–969. IEEE, 2005.
    Google ScholarLocate open access versionFindings
  • Aniko Simon, J-C Pret, and A Peter Johnson. A fast algorithm for bottom-up document layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(3):273–277, 1997.
    Google ScholarLocate open access versionFindings
  • Carlos Soto and Shinjae Yoo. Visual detection with context for document layout analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3462–3468, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1348. URL https://www.aclweb.org/anthology/D19-1348.
    Locate open access versionFindings
  • Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inceptionresnet and the impact of residual connections on learning. In AAAI, 2016.
    Google ScholarLocate open access versionFindings
  • Matheus Palhares Viana and Dario Augusto Borges Oliveira. Fast cnn-based document layout analysis. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 1173– 1180, 2017.
    Google ScholarLocate open access versionFindings
  • H. Wei, M. Baechler, F. Slimane, and R. Ingold. Evaluation of svm, mlp and gmm classifiers for layout analysis of historical documents. In 2013 12th International Conference on Document Analysis and Recognition, pp. 1220–1224, Aug 2013. doi: 10.1109/ICDAR.2013.247.
    Locate open access versionFindings
  • Xiaowei Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C. Lee Giles. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4342–4351, 2017.
    Google ScholarLocate open access versionFindings
  • Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. Publaynet: largest dataset ever for document layout analysis. ArXiv, abs/1908.07836, 2019.
    Findings
Your rating :
0

 

Tags
Comments