Pipelining of a Mobile SoC and an External NPU for Accelerating CNN Inference.

IEEE Embed. Syst. Lett.(2024)

Cited 0|Views1
No score
Convolutional Neural Networks (CNN) algorithms are increasingly being deployed on edge devices with the co-growth of hardware and software. Deploying CNNs on resource-constrained devices often requires optimization of CPUs and GPUs. While a dedicated hardware such as a neural processing unit(NPU) has been successfully introduced, cooperative methods between CPU, GPU and NPU are still immature. In this paper, we propose two approaches to optimize the integration of a mobile system-on-chip(SoC) with an external neural processing unit(eNPU) to achieve harmonious pipelining and enhance inference speed and throughput. The first approach involves a BLAS library search scheme to allocate optimal libraries per layer on the host side, while the second approach optimizes performance by searching for model slice points. We utilize CPU-based NNPACK, OpenBLAS, and GPU-based CLBlast as computing libraries that are automatically allocated. The entire neural network is optimally split into two segments based on the characteristics of the neural network layers and hardware performance. We evaluated our algorithm on various mobile devices, including the Hikey-970, Hikey-960, and Firefly-rk3399. Through experiments, we show that the proposed pipeline inference method reduces latency by 10% and increases throughput by more than 17% compared to parallel execution on an eNPU and SoC.
Translated text
Key words
Inference Pipelining,NPU pipelining,Convolutional Neural Network,Model Slicing
AI Read Science
Must-Reading Tree
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined