DataNet: A Data Distribution-Aware Method for Sub-Dataset Analysis on Distributed File Systems

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)(2016)

引用 9|浏览27
暂无评分
摘要
In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g, the Hadoop file system. Our experiments show that the sub-datasets' distribution over HDFS blocks can often cause the corresponding analysis to suffer from a seriously imbalanced parallel execution. This is because the locality of individual sub-datasets is hidden by the Hadoop file system and the content clustering of sub-datasets results in some computational nodes carrying out much more workload than others. We conduct a comprehensive analysis on how the imbalanced computing patterns occur and their sensitivity to the size of a cluster. We then propose a novel method to optimize sub-dataset analysis over distributed storage systems referred to as DataNet. DataNet aims to achieve distribution-aware and workload-balanced computing and consists of the following three parts. Firstly, we propose an efficient algorithm with linear complexity to obtain the meta-data of sub-dataset distributions. Secondly, we design an elastic storage structure called ElasticMap based on the HashMap and BloomFilter techniques to store the meta-data. Thirdly, we employ a distribution-aware algorithm for sub-dataset applications to achieve a workload-balance in parallel-execution. Our proposed method can benefit different sub-dataset analyses with various computational requirements. Experiments are conducted on PRObEs Marmot 128-node cluster testbed and the results show the performance benefits of DataNet.
更多
查看译文
关键词
distribution-aware algorithm,BloomFilter technique,HashMap,ElasticMap,elastic storage structure,workload-balanced computing,distribution-aware computing,content clustering,Hadoop file system,distributed file system,subdataset analysis,data distribution-aware method,DataNet
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要