Learning To Identify Hundreds Of Flex-Form Documents

DOCUMENT RECOGNITION AND RETRIEVAL VI(1999)

引用 7|浏览3
暂无评分
摘要
Electronic document management systems (EDMS) are advancing information retrieval from hardcopy documents. One of the key technological stumbling blocks to approaching these applications is the ability to reliably and rapidly differentiate among the many document variants.This paper presents an inductive document classifier (IDC) and its application to document identification. The most important features of the presented system are learning capability, handling large volumes of highly variant documents, and high performance. LDC learns new document types (variants) from examples. To this end, it automatically extracts discriminatory features from images of various document types, generates generalized descriptions, and stores them in the knowledge base. The classification of an unknown document is based on matching its description to all general rules in the knowledge base, and selecting the best matching document types as final classifications. Both learning and identification processes are fast and accurate. The speed is gained due to optimal image processing and feature construction procedures. Identification accuracy is very high despite the fact that the discriminatory features are generated solely based on page layout information. IDC operates in two separate components of an EDMS: Knowledge Base Maintainer (KBM) and Production Identifier (PI). KBM builds a knowledge base and maintains its integrity. PI utilizes learned knowledge during the identification processes.
更多
查看译文
关键词
inductive learning from examples, document recognition, form and flex-form identification, rule learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要