Taming Big Data Svm With Locality-Aware Scheduling

2016 FOURTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD 2016)(2016)

引用 0|浏览11
暂无评分
摘要
Incorporating MPI programming model into data-intensive file system for big data application is significant in performance research for optimization purpose. In this paper we ported an MPI-SVM solver, originally developed for HPC environment to the Hadoop distributed file system (HDFS). We analyzed the performance bottlenecks with which the SVM solver will be confronted on the HDFS. It is known the storage expansion on HDFS comes with a skewed data distribution. As a result, we found out that some hot nodes always receive condensed I/O requests while other nodes always send remote requests. These remote requests make the I/O delays elongate on hot nodes, which leads to performance bottleneck for our solver. Thus we specifically improved the data preprocessing part that requires large amount of I/O operations by a deterministic scheduling method.Our improvement showed a balanced read pattern on each node. The time ratio between the longest process and the shortest process has been reduced by 60%. Also the average read time has significantly reduced by 78%. The data served on each node also showed a small variance in comparison with the originally ported SVM algorithm. We believe our design avoids the overhead introduced by remote I/O operations, which will be beneficial to many algorithms when coping with large scale of data.
更多
查看译文
关键词
HDFS, parallel SVM, read performance, data locality, MPI
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要