A Collective Communication Layer for the Software Stack of Big Data Analytics

2016 IEEE International Conference on Cloud Engineering Workshop (IC2EW)(2016)

引用 2|浏览38
暂无评分
摘要
The landscape of distributed computing is rapidly evolving, with computers exhibiting increasing processing capabilities with many-core architectures. Almost every field of science is now data driven and requires analysis of massive datasets. The algorithms for analytics such as machine learning can be used to discover properties of a given dataset and make predictions based on it. However, there is still a lack of simple and unified programming frameworks for these data intensive applications, and many existing efforts are designed with specialized means to speed up individual algorithms. In this thesis research, a distributed programming model, MapCollective, is defined so that it can be easily applied to many machine learning algorithms. Specifically, algorithms that fit the iterative computation model can be easily parallelized with a unique collective communication layer for efficient synchronization. In contrast to traditional parallelization strategies that focus on handling high volume input data, a lesser known challenge is that the shared model data between parallel workers, is equally high volume in multidimensions and required to be communicated continually during the entire execution. This extends the understanding of data aspects in computation from in-memory caching of input data (e.g. iterative MapReduce model) to fine-grained synchronization on model data (e.g. MapCollective model). A library called Harp is developed as a Hadoop plugin to demonstrate that sophisticated machine learning algorithms can be simply abstracted with the MapCollective model and conveniently developed on top of the MapReduce framework. K-means and Multi-Dimensional Scaling (MDS) are tested over 4096 threads on the IU Big Red II Supercomputer. The results show linear speedup with an increasing number of parallel units.
更多
查看译文
关键词
iterative computation model,high volume input data handling,shared model data,parallel workers,in-memory caching,fine-grained synchronization,Harp library,Hadoop plugin,MapReduce framework,K-means algorithm,multidimensional scaling,MDS,IU Big Red II Supercomputer,machine learning algorithms,MapCollective model,distributed programming model,data intensive applications,unified programming frameworks,many-core architectures,distributed computing,Big Data analytics,software stack,collective communication layer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要