Identifying Insufficient Data Coverage in Databases with Multiple Relations

PROCEEDINGS OF THE VLDB ENDOWMENT(2020)

引用 37|浏览64
暂无评分
摘要
In today's data-driven world, it is critical that we use appropriate datasets for analysis and decision-making. Datasets could be biased because they reflect existing inequalities in the world, due to the data scientists' biased world view, or due to the data scientists' limited control over the data collection process. For these reasons, it is important to ensure adequate data coverage across different groups over the intersection of multiple attributes. Often, the dataset to be analyzed is obtained through complex joins and predicate combinations over multiple relational tables in a database. Due to the sheer data volume we often have to deal with, determining adequate coverage can require an unacceptably long execution time. In this paper, we provide an efficient approach for coverage analysis, given a set of attributes across multiple tables. To identify regions with insufficient coverage in the combinatorially large set of value combinations, we design an index scheme to avoid explicit table joins, achieve efficient memory usage, and support predicate combination at a high level of parallelism. We also propose P-WALK, a priority-based search algorithm, to traverse the lattice space. Since in practice, coverage assessment typically does not require precise COUNT aggregation results, we further present approximate methods to reduce computation time. Experimental evaluation using three real-world datasets shows the effectiveness, efficiency, and accuracy of proposed methods.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要