Optimized Execution of Parallel Loops via User-Defined Scheduling Policies

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing(2019)

引用 7|浏览33
暂无评分
摘要
On-node parallelism continues to increase in importance for high-performance computing and most newly deployed supercomputers have tens of processor cores per node. These higher levels of on-node parallelism exacerbate the impact of load imbalance and locality in parallel computations, and current programming systems notably lack features to enable efficient use of these large numbers of cores or require users to modify codes significantly. Our work is motivated by the need to address application-specific load balance and locality requirements with minimal changes to application codes. In this paper, we propose a new approach to extend the specification of parallel loops via user functions that specify iteration chunks. We also extend the runtime system to invoke these user functions when determining how to create chunks and schedule them on worker threads. Our runtime system starts with subspaces specified in the user functions, performs load balancing of chunks concurrently, and stores the balanced groups of chunks to reduce load imbalance in future invocations. Our approach can be used to improve load balance and locality in many dynamic iterative applications, including graph and sparse matrix applications. We demonstrate the benefits of this work using MiniMD, a miniapp derived from LAMMPS, and three kernels from the GAP Benchmark Suite: Breadth-First Search, Connected Components, and PageRank, each evaluated with six different graph data sets. Our approach achieves geometric mean speedups of 1.16× to 1.54× over four standard OpenMP schedules and 1.07× over the static_steal schedule from recent research.
更多
查看译文
关键词
Load Balancing, Loop Parallelism, OpenMP, User-Defined Scheduling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要