Profiling the semantics of n-ary web table data

Proceedings of the International Workshop on Semantic Big Data(2019)

引用 7|浏览80
暂无评分
摘要
The Web contains millions of relational HTML tables, which cover a multitude of different, often very specific topics. This rich pool of data has motivated a growing body of research on methods that use web table data to extend local tables with additional attributes or add missing facts to knowledge bases. Nearly all existing approaches for these tasks build upon the assumption that web table data consists of binary relations, meaning that an attribute value depends on a single key attribute, and that the key attribute value is contained in the HTML table. Inspecting randomly chosen tables on the Web, however, quickly reveals that both assumptions are wrong for a large fraction of the tables. In order to better understand the potential of non-binary web table data for downstream applications, this papers analyses a corpus of 5 million web tables originating from 80 thousand different web sites with respect to how many web table attributes are non-binary, what composite keys are required to correctly interpret the semantics of the non-binary attributes, and whether the values of these keys are found in the table itself or need to be extracted from the page surrounding the table. The profiling of the corpus shows that at least 38% of the relations are non-binary. Recognizing these relations requires information from the title or the URL of the web page in 50% of the cases. We find that different websites use keys of varying length for the same dependent attribute, e.g. one cluster of websites presents employment numbers depending on time, another cluster presents them depending on time and profession. By identifying these clusters, we lay the foundation for selecting Web data sources according to the specificity of the keys that are used to determine specific attributes.
更多
查看译文
关键词
data profiling, key detection, n-ary relations, web tables
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要