An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs

IWOCL '24: Proceedings of the 12th International Workshop on OpenCL and SYCL（2024）

引用 0|浏览5

暂无评分

摘要

SYCL is a parallel programming language and enables heterogeneous computing on various devices. SYCL CPU device [1] uses CPU as a device to run SYCL kernel. While most SYCL concepts, such as devices memory model, sub-group and work-group construction, can be mapped on GPU hardware, the CPU device lacks native support for them. Therefore, these concepts need to be emulated on the CPU device to ensure full hardware utilization to achieve the performance portability of SYCL programs. To facilitate task parallelism at the work-group level, the SYCL CPU device distributes the execution of SYCL work-groups to CPU threads, each of which has a restricted stack size. The SYCL device’s memory model consists of three distinct memory regions. Local memory is accessible by all the work-items in a single work-group. Private memory is accessible to a work-item. The CPU device doesn’t have dedicated hardware to support local and private memory. Therefore, they are emulated by allocating a block of memory for each of them on the stack. A stack overflow could occur when a kernel uses a large private or local memory, as a thread’s stack size can’t be changed after its creation. The probability of error is much higher on Windows since the default thread stack size of a master thread is only 1MB. To address this issue, SYCL CPU device previously adopted an approach of context swapping to expand the stack size using low-level API provided by operating system. Application master thread stack size is 8MB on Linux and 1MB on Windows. The stack size for other worker threads is set to 8MB on a 64-bit system and 4MB on a 32-bit system. When a work-group requires a stack size larger than that of its executing thread’s stack size, the SYCL CPU device runtime swaps the thread’s context before execution. However, this method results in large-scale performance degradation on Windows due to the swapping involving frequent and inefficient memory movement. Some SYCL workloads on Windows even hang with this approach. To solve the performance issue, we propose a novel approach that replaces allocation instructions for private and local memory with an address in heap. A block of memory is allocated on the heap before kernel execution. The heap memory size can grow in case another kernel with larger stack memory is executed later. The heap buffer pointer is passed to the kernel as an implicit argument. A null pointer is passed to the kernel if heap usage for the stack is unnecessary. During kernel compilation, we replace the original alloca instructions with specific instructions to access heap memory for private and local buffers. Experiment results on 21 SYCL workloads show the novel approach significantly outperforms the context-swapping approach. The geomean speedup is 153.73 on Windows and 1.11 on Linux, and the workloads don’t hang on Windows anymore. The novel approach doesn’t have any evident performance penalty compared to the baseline that doesn’t use heap. The novel approach could be adopted by other SYCL or SPMD CPU devices, such as the SYCL Native CPU device [2], since they all face the same problem. This feature will be delivered in Intel OneAPI 2024.2 toolkit.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要