High performance and energy efficient inference for deep learning on multicore ARM processors using general optimization techniques and BLIS

Adrián Castelló,Sergio Barrachina,Manuel F. Dolz,Enrique S. Quintana-Ortí,Pau San Juan,Andrés E. Tomás

Journal of Systems Architecture（2022）

引用 4|浏览15

暂无评分

摘要

We evolve PyDTNN, a framework for distributed parallel training of Deep Neural Networks (DNNs), into an efficient inference tool for convolutional neural networks. Our optimization process on multicore ARM processors involves several high-level transformations of the original framework, such as the development and integration of Cython routines to exploit thread-level parallelism; the design and development of micro-kernels for the matrix multiplication, vectorized with ARM’s NEON intrinsics, that can accommodate layer fusion; and the appropriate selection of several cache configuration parameters tailored to the memory hierarchy of the target ARM processors.

查看译文

关键词

Convolutional neural network,Inference,Multicore low-power processors

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要