I-Sampling: A New Block-Based Sampling Method for Large-Scale Dataset

2017 IEEE International Congress on Big Data (BigData Congress)(2017)

引用 4|浏览59
暂无评分
摘要
We propose a block-based sampling method (I-sampling) which randomly selects the base data blocks from the block pool of a large-scale dataset rather than directly chooses records from the original dataset. I-sampling firstly partitions the given large-scale dataset into the non-overlapping primary data blocks. Secondly, the records in each primary data block are randomly shuffled and the corresponding shuffling data blocks are formed. Thirdly, I-sampling generates a block pool which is a set of base data blocks. The theoretical analysis proves that the base data block has the approximately equal probability distribution with the original dataset. The training dataset is finally produced by randomly selecting base data blocks from the block pool. Different from the traditional record-based sampling method, I-sampling has the good extendibility to deal with the large-scale datasets on distributed system. The simulated experiments demonstrate the feasibility of I-sampling and reveal that the training datasets which are provided by I-sampling are equivalent to the simple random sampling (one of representative record-based sampling methods) for the given machine learning algorithm.
更多
查看译文
关键词
large-scale dataset,block-based sampling,record-based sampling,distributed system
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要