Technical Talk: A SYCL Extension for User-Driven Online Kernel Fusion

Victor Perez,Lukas Sommer, Victor Lomüler,Kumudha Narasimhan,Mehdi Goli

IWOCL '23: Proceedings of the 2023 International Workshop on OpenCL（2023）

引用 0|浏览5

暂无评分

摘要

Heterogeneous programming models such as SYCL allow developers to integrate a variety of accelerators found in today’s heterogeneous systems into an application with ease. However, while offloading specific tasks to specialized accelerators can deliver significant performance improvements for many applications, short-running device kernels remain a challenge for most heterogeneous programming models. Each invocation of a device kernel is linked to some overhead, caused by the necessary data-transfers, kernel launch and synchronization between host and device. In particular, for a sequence of short-running kernels, this can lead to an unfavourable ratio of overhead and actual computation, resulting in performance degradation. One potential solution to address this problem is to merge multiple small, memory-bound, short-running kernels into a single larger kernel. This leads to better use of the device’s resources and amortizes the device launch overhead. Yet, manually creating fused kernels can be an error-prone, challenging task for developers, and the resulting kernels are less reusable and maintainable. The extension to the SYCL API presented in this talk aims to automate the creation of fused kernels. It provides a mechanism for users or software frameworks using SYCL to instruct the runtime to automatically fuse multiple device kernels at runtime, without the need for manual implementation of the fused kernel. Users or software frameworks can use their application and domain knowledge, as well as runtime context information, to determine when fusion of kernels is legal and profitable, while the actual process of creating a fused kernel is automated by the SYCL runtime. Reducing the kernel launch overhead is however not the only way kernel fusion can improve application performance. The LLVM-based JIT compiler integrated into the SYCL runtime implementation for automatic creation of fused kernels can perform further optimizations. One such optimization is the internalization of dataflow. Intermediate results that originally needed to be communicated via global memory between the different kernels now become internal dataflow of the fused kernel. Replacing slow global memory accesses for this internalized dataflow with faster accesses to local memory or even registers can yield significant performance improvements for many applications. The extension presented in this talk is currently an experimental vendor extension, targeting SYCL version 2020. The initial proof-of-concept implementation was based on Codeplay’s ComputeCpp SYCL implementation and has also been contributed and open-sourced as part of the DPC++ SYCL implementation. To demonstrate the performance improvements unlocked by the extension, two different types of workloads are evaluated on Intel CPU and integrated Intel GPUs. For a set of sixteen typical operator sequences from neural networks with various input sizes, kernel fusion achieves speedups between 0.9x and 2.26x on GPU (geo.-mean 1.35x), and between 1.02x and 3.2x on CPU (geo.-mean 1.78x). For complete neural networks, this translates to 1.19x (Resnet 50) and 1.68x (VGG 16) speedup on CPU, and 1.15x (Resnet 50) and 1.02x (VGG 16) speedup on GPU. For the six benchmarks 3mm, bicg, correlation, covariance, fdtd2d and gramschmidt from the SYCL Bench benchmark suite with different input sizes, fusion achieves speedups between 0.98x and 4.91x on GPU (geo.-mean 1.34x), and speedups between 0.82x and 3.28x on CPU (geo.-mean 1.06x). In summary, this talk presents a SYCL extension automating the creation of fused kernels on user request and shows the potential performance benefits of such an extension on different workloads.

查看译文

关键词

online kernel fusion,sycl extension,user-driven

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要