High performance and energy efficient inference for deep learning on multicore ARM processors using general optimization techniques and BLIS

Journal of Systems Architecture(2022)

引用 4|浏览15
暂无评分
摘要
We evolve PyDTNN, a framework for distributed parallel training of Deep Neural Networks (DNNs), into an efficient inference tool for convolutional neural networks. Our optimization process on multicore ARM processors involves several high-level transformations of the original framework, such as the development and integration of Cython routines to exploit thread-level parallelism; the design and development of micro-kernels for the matrix multiplication, vectorized with ARM’s NEON intrinsics, that can accommodate layer fusion; and the appropriate selection of several cache configuration parameters tailored to the memory hierarchy of the target ARM processors.
更多
查看译文
关键词
Convolutional neural network,Inference,Multicore low-power processors
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要