Data Extraction from Web Tables: The Devil is in the Details

Document Analysis and Recognition(2011)

引用 31|浏览0
暂无评分
摘要
We present a method based on header paths for efficient and complete extraction of labeled data from tables meant for humans. Although many table configurations yield to the proposed syntactic analysis, some require access to semantic knowledge. Clicking on one or two critical cells per table, through a simple interface, is sufficient to resolve most of these problem tables. Header paths, a purely syntactic representation of visual tables, can be transformed ("factored") into existing representations of structured data such as category trees, relational tables, and RDF triples. From a random sample of 200 web tables from ten large statistical web sites, we generated 376 relational tables and 34,110 subject-predicate-object RDF triples.
更多
查看译文
关键词
web tables,problem table,data extraction,syntactic representation,header path,rdf triple,relational table,large statistical web site,structured data,proposed syntactic analysis,table configuration,subject-predicate-object rdf triple,data mining,html,indexing,rdf,semantic web,resource description framework
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要