Exploiting Sparsity to Accelerate Fully Connected Layers of CNN-Based Applications on Mobile SoCs.

Xinfeng Xie, Dayou Du,Qian Li,Yun Liang,Wai Teng Tang,Zhongliang Ong,Mian Lu,Huynh Phung Huynh,Rick Siow Mong Goh

ACM Trans. Embedded Comput. Syst.（2018）

引用 21|浏览64

暂无评分

摘要

Convolutional neural networks (CNNs) are widely employed in many image recognition applications. With the proliferation of embedded and mobile devices, such applications are becoming commonplace on mobile devices. Network pruning is a commonly used strategy to reduce the memory and storage footprints of CNNs on mobile devices. In this article, we propose customized versions of the sparse matrix multiplication algorithm to speed up inference on mobile devices and make it more energy efficient. Specifically, we propose a Block Compressed Sparse Column algorithm and a bit-representation-based algorithm (BitsGEMM) that exploit sparsity to accelerate the fully connected layers of a network on the NVIDIA Jetson TK1 platform. We evaluate the proposed algorithms using real-world object classification and object detection applications. Experiments show that performance speedups can be achieved over the original baseline implementation using cuBLAS. On object detection CNNs, an average speedup of 1.82× is obtained over baseline cuBLAS in the fully connected layer of the VGG model, whereas on classification CNNs, an average speedup of 1.51× is achieved for the fully connected layer of the pruned-VGG model. Energy consumption reduction of 43--46% is also observed due to decreased computational and memory bandwidth demands.

查看译文

关键词

Convolutional neural networks, matrix tiling, mobile GPU, sparse matrix compression, sparse matrix multiplication, sparse matrix-vector multiplication

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要