Maintaining very large random samples using the geometric file

The VLDB Journal(2007)

引用 9|浏览1
暂无评分
摘要
Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a “sample” is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We also present algorithms to retrieve small size random sample from large disk-based sample which may be used for various purposes including statistical analyses by a DBMS.
更多
查看译文
关键词
Data Stream,Naive Algorithm,Reservoir Sampling,Large Random Sample,Data Management Tool
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要