Automatic generation of ARM NEON micro-kernels for matrix multiplication

Guillermo Alaejos,Héctor Martínez,Adrián Castelló,Manuel F. Dolz,Francisco D. Igual,Pedro Alonso-Jordá,Enrique S. Quintana-Ortí

The Journal of Supercomputing（2024）

引用 0|浏览1

暂无评分

摘要

General matrix multiplication (gemm) is a fundamental kernel in scientific computing and current frameworks for deep learning. Modern realisations of gemm are mostly written in C, on top of a small, highly tuned micro-kernel that is usually encoded in assembly. The high performance realisation of gemm in linear algebra libraries in general include a single micro-kernel per architecture, usually implemented by an expert. In this paper, we explore a couple of paths to automatically generate gemm micro-kernels, either using C++ templates with vector intrinsics or high-level Python scripts that directly produce assembly code. Both solutions can integrate high performance software techniques, such as loop unrolling and software pipelining, accommodate any data type, and easily generate micro-kernels of any requested dimension. The performance of this solution is tested on three ARM-based cores and compared with state-of-the-art libraries for these processors: BLIS, OpenBLAS and ArmPL. The experimental results show that the auto-generation approach is highly competitive, mainly due to the possibility of adapting the micro-kernel to the problem dimensions.

查看译文

关键词

Matrix multiplication,ARM NEON,SIMD arithmetic units,High performance

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要