Exploring and Analyzing Change: The Janus Project.

CIKM(2022)

引用 0|浏览23
暂无评分
摘要
Data change, all the time. The Janus project seeks to address the Variability dimension of Big Data by modeling, exploring, and analyzing such change, providing valuable insights into the evolving real world and ways in which data about it are collected and used. We start by identifying technical challenges that need to be addressed to realize the Janus vision. Towards this end, we have extracted and worked with the histories of various structured datasets, including DBLP, IMDB, open government data, and Wikipedia, for which a detailed history of every edit is available. Our DBChEx (Database Change Explorer) prototype enables interactive exploration of data and schema changes, and we show how DBChEx can help users gain valuable insights by exploring two real-world datasets, IMDB and Wikipedia infoboxes. Based on an analysis of the history of 3.5M tables on the English Wikipedia for a total of 53.8M table versions, we then illustrate the rich history of structured Wikipedia data: we show that tables are created in certain locations, they change their shape, they move, they grow, they shrink, their data change, they vanish, and they re-appear; indeed, each table has a life of its own. Finally, to help automatically interpret the useful knowledge harbored in the history of Wikipedia tables, we present recent results on two technical problems: (i) identifying Natural Keys, a particularly important piece of metadata, which serves as a primary key in tables over time and consists of attributes inherent to an entity, and (ii) matching tables, infoboxes and lists within a Wikipedia page across page revisions. We solve these problems at scale and make the resulting curated datasets available to the community to facilitate future research.
更多
查看译文
关键词
change
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要