Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks

Omaha, NE(2007)

引用 14|浏览0
暂无评分
摘要
Performing data-mining tasks such as clustering, classification, and prediction on large datasets is an arduous task and, many times, it is an infeasible task given current hardware limitations. The distributed nature of peer-to-peer databases further complicates this issue by introducing an access overhead cost in addition to the cost of sending individual tuples over the network. We propose a two-level sampling approach focusing on peer-to-peer databases for maximizing sample quality given a user-defined communication budget. Given that individual peers may have varying cardinality we propose an algorithm for determining the optimal sample rate (the percentage of tuples to sample from a peer) for each peer. We do this by analyzing the variance of individual peers, ultimately minimizing the total variance of the entire sample. By performing local optimization of individual peer sample rates we maximize approximation accuracy of the samples. We also offer several techniques for sampling in peer-to-peer databases given various amounts of known and unknown information about the network and its peers.
更多
查看译文
关键词
arduous task,individual peer,heterogeneous peer-to-peer networks,sample rate,access overhead cost,peer-to-peer databases,data-mining task,individual tuples,optimal sample rate,sample quality,efficient data sampling,entire sample,database management systems,local optimization,overhead cost,data mining,approximation theory
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要