Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

Pablo San Juan,Rafael Rodríguez-Sánchez,Francisco D. Igual,Pedro Alonso-Jordá,Enrique S. Quintana-Ortí

The Journal of Supercomputing（2021）

引用 3|浏览4

暂无评分

摘要

We introduce a high performance, multi-threaded realization of the gemm kernel for the ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether the corresponding author is correctly identified. floating point operands. Our code is especially designed for efficient machine learning inference (and to a certain extent, also training) with deep neural networks. The results on the NVIDIA Carmel multicore processor, which implements the ARMv8.2 architecture, show considerable performance gains for the gemm kernel, close to the theoretical peak acceleration that could be expected when moving from 32-bit arithmetic/data to 16-bit. Combined with the type of convolution operator arising in convolutional neural networks, the speed-ups are more modest though still relevant.

查看译文

关键词

Deep learning, Matrix multiplication, High performance, NVIDIA Carmel system-on-chip (SoC)

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要