Orchestrating Large-Scale SpGEMMs using Dynamic Block Distribution and Data Transfer Minimization on Heterogeneous Systems.

ICDE(2023)

引用 0|浏览8
暂无评分
摘要
Sparse general matrix-matrix multiplication (SpGEMM) is a major kernel in various emerging applications, such as database management systems, deep learning, graph analysis, and recommendation systems. Since SpGEMM requires extensive computation, many SpGEMM techniques have been implemented based on graphics processing units (GPUs) to exploit massive data parallelism completely. However, traditional SpGEMM techniques usually do not fully utilize the GPU because most non-zero elements of the target sparse matrices exist in a few hub nodes, and non-hub nodes barely have non-zero elements. The data-related characteristics (power law) result in a significant degradation in performance because of the load imbalance between the GPU cores and the low utilization of each core. Many attempts have been made through recent implementations to solve this problem using smart pre-/post-processing. However, the net performance hardly improves and sometimes even deteriorates owing to the large overheads. Additionally, non-hub nodes are inherently not suitable for GPU computing, even after optimization. Furthermore, the performance is no longer dominated by kernel execution, but by data transfers such as device-to-host (D2H) data transfers and file I/Os, owing to the rapid growth in the computing power of GPUs and input data size.Therefore, this work proposes a Dynamic Block Distributor (DBD), a novel full-system-level SpGEMM orchestration framework for heterogeneous systems, improving the overall performance by enabling an efficient CPU-GPU collaboration and further minimizing the overhead in data transfer between all the system elements. This framework first divides the target matrix into smaller blocks and then offloads the computation of each block to an appropriate computing unit between a GPU and CPU based on its workload type and the status of resource utilization at runtime. It also minimizes the overhead in data transfer with simple but suitable techniques, such as Row Collecting, I/O Overlapping, and I/O Binding. Our experiments showed that this framework increased the execution latency of SpGEMM, which included both the kernel execution and D2H transfers, by 3.24x on average, and the overall execution time by 2.07x on average, compared to that of the baseline cuSPARSE library.
更多
查看译文
关键词
Sparse matrix multiplication,large-scale sparse matrix,GPU,heterogeneous
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要