Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

High-Performance Interconnects(2013)

引用 33|浏览0
暂无评分
摘要
The emergence of co-processors such as Intel Many Integrated Cores (MICs) is changing the landscape of supercomputing. The MIC is a memory constrained environment and its processors also operate at slower clock rates. Furthermore, the communication characteristics between MIC processes are also different compared to communication between host processes. Communication libraries that do not consider these architectural subtleties cannot deliver good communication performance. The performance of MPI collective operations strongly affect the performance of parallel applications. Owing to the challenges introduced by the emerging heterogeneous systems, it is critical to fundamentally re-design collective algorithms to ensure that applications can fully leverage the MIC architecture. In this paper, we propose a generic framework to optimize the performance of important collective operations, such as, MPI Bcast, MPI Reduce and MPI Allreduce, on Intel MIC clusters. We also present a detailed analysis of the compute phases in reduce operations for MIC clusters. To the best of our knowledge, this is the first paper to propose novel designs to improve the performance of collectives on MIC clusters. Our designs improve the latency of the MPI Bcast operation with 4,864 MPI processes by up to 76%. We also observe up to 52.4% improvements in the communication latency of the MPI Allreduce operation with 2K MPI processes on heterogeneous MIC clusters. Our designs also improve the execution time of the WindJammer application by up to 16%.
更多
查看译文
关键词
mpi collective operation,supercomputing,application program interfaces,mic architecture,intel mic infiniband clusters,intel mic cluster,memory constrained environment,mic cluster,parallel architectures,heterogeneous systems,optimized mpi allreduce,designing optimized mpi broadcast,mpi reduce,infiniband clusters,windjammer application,multiprocessing systems,clocks,optimized mpi broadcast,mpi allreduce,communication performance,heterogeneous mic clusters,mpi bcast,mpi_allreduce,performance evaluation,mic process,coprocessors,message passing,communication libraries,mpi collective operation performance,intel many integrated core infiniband clusters,mpi allreduce operation,clock rates,mpi bcast operation,mpi_bcast operation latency,communication characteristics,integrated core
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要