Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS(2024)

Cited 0|Views25
No score
Abstract
Sparse training is one of the promising techniques to reduce the computational cost of deep neural networks (DNNs) while retaining high accuracy. In particular, N:M fine-grained structured sparsity, where only ${N}$ out of consecutive ${M}$ elements can be nonzero, has attracted attention due to its hardware-friendly pattern and capability of achieving a high sparse ratio. However, the potential to accelerate N:M sparse DNN training has not been fully exploited, and there is a lack of efficient hardware supporting N:M sparse training. To tackle these challenges, this article presents a computation-efficient training scheme for N:M sparse DNNs using algorithm, architecture, and dataflow co-design. At the algorithm level, a bidirectional weight pruning method, dubbed BDWP, is proposed to leverage the N:M sparsity of weights during both forward and backward passes of DNN training, which can significantly reduce the computational cost while maintaining model accuracy. At the architecture level, a sparse accelerator for DNN training, namely, SAT, is developed to neatly support both the regular dense operations and the computation-efficient N:M sparse operations. At the dataflow level, multiple optimization methods ranging from interleave mapping, pregeneration of N:M sparse weights, and offline scheduling, are proposed to boost the computational efficiency of SAT. Finally, the effectiveness of our training scheme is evaluated on a Xilinx VCU1525 FPGA card using various DNN models (ResNet9, ViT, VGG19, ResNet18, and ResNet50) and datasets (CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet). Experimental results show the SAT accelerator with the BDWP sparse training method under 2:8 sparse ratio achieves an average speedup of 1.75x over that with the dense training, accompanied by a negligible accuracy loss of 0.56% on average. Furthermore, our proposed training scheme significantly improves the training throughput by 2.97x - 25.22x and the energy efficiency by 1.36x - 3.58x over prior FPGA-based accelerators.
More
Translated text
Key words
Algorithm-hardware codesign,deep neural networks (DNNs),DNN training,neural network compression,pruning,sparse training
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined