Memory-Side Acceleration and Sparse Compression for Quantized Packed Convolutions

2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)(2022)

引用 0|浏览11
暂无评分
摘要
Neural Network compression techniques, such as parameter quantization and weight pruning have made deep neural network (DNN) inference more efficient for low-power devices such as MCUs and edge devices by reducing the memory and computation overhead required with minimal impact on model accuracy. To avoid storing and computing zeros, these techniques necessitate the use of sparse data representations, which introduces execution overhead to locate values required by a computation. Sparse matrix formats like Compressed Sparse Row (CSR) and other more recent designs are computationally inefficient when applied to the convolution algorithm as well as inefficient for storing quantized values. In this paper, we outline an intuitive extension of CSR called Partitioned Sparse Representation (PSR) in conjunction with a convolution algorithm that hides the cost of indexing overhead via a simple memory-side RISC-like core. PSR divides the entire weight array for a convolution layer into partitions that allow for smaller (e.g., 8-bit) indexes to reduce storage overhead. We also rely on a memory-side accelerator called HHT, a programmable, near-memory RISC-like co-processor that enables efficient processing of sparse data (including PSR). We show that HHT together with PSR allows the CPU to maximize the advantage of RISC-V packed instructions on sparse quantized data. We show as much as 10x speedup for sparse CONV with HHT over a baseline of the CPU performing all computations on dense data. HHT performs 2.7x faster on end-to-end image classification inference over the baseline and achieves 70% energy savings over sparse CONV with CPU performing all computations.
更多
查看译文
关键词
CNN,sparsity,compression,RISC-V,programmable,quantization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要