Hedgehog: Understandable Scheduler-Free Heterogeneous Asynchronous Multithreaded Data-Flow Graphs

2020 IEEE/ACM 3rd Annual Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)(2020)

引用 3|浏览18
暂无评分
摘要
Getting performance on high-end heterogeneous nodes is challenging. This is due to the large semantic gap between a computation's specification-possibly mathematical formulas or an abstract sequential algorithm-and its parallel implementation; this gap obscures the program's parallel structures and how it gains or loses performance. We present Hedgehog, a library aimed at coarse-grain parallelism. It explicitly embeds a dataflow graph in a program and uses this graph at runtime to drive the program's execution so it takes advantage of hardware parallelism (multicore CPUs and multiple accelerators). Hedgehog has asynchronicity built in. It statically binds individual threads to graph nodes, which are ready to fire when any of their inputs are available. This allows Hedgehog to avoid using a global scheduler and the loss of performance associated with global synchronizations and managing of thread pools. Hedgehog provides a separation of concerns and distinguishes between compute and state maintenance tasks. Its API reflects this separation and allows a developer to gain a better understanding of performance when executing the graph. Hedgehog is implemented as a C++ 17 headers-only library. One feature of the framework is its low overhead; it transfers control of data between two nodes in ≈ 1 μs. This low overhead combines with Hedgehog's API to provide essentially cost-free profiling of the graph, thereby enabling experimentation for performance, which enhances a developer's insight into a program's performance. Hedgehog's asynchronous data-flow graph supports a data streaming programming model both within and between graphs. We demonstrate the effectiveness of this approach by highlighting the performance of streaming implementations of two numerical linear algebra routines, which are comparable to existing libraries: matrix multiplication achieves >95 % of the theoretical peak of 4 GPUs; LU decomposition with partial pivoting starts streaming partial final result blocks 40× earlier than waiting for the full result. The relative ease and understandability of obtaining performance with Hedgehog promises to enable non-specialists to target performance on high-end single nodes.
更多
查看译文
关键词
high performance computing, data-flow, heterogeneous computing, dataflow, pipelining, HPC, parallelism, heterogeneity, metaprogramming, C++, GPU
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要