Eureka: Efficient Tensor Cores for One-sided Unstructured Sparsity in DNN Inference

56TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, MICRO 2023(2023)

引用 0|浏览1
暂无评分
摘要
Deep neural networks (DNNs), while enormously popular, continue to place ever higher compute demand for which GPUs provide specialized matrix multipliers called tensor cores. To reduce the compute demand via sparsity, Nvidia Ampere's tensor cores support 2:4 structured sparsity in the filters (i.e., two non-zeros out of four values) which provides uniform 50% sparsity without any load imbalance issues. Consequently, the sparse tensor cores maintain (input or output) operand stationarity, which is fundamental for avoiding high-overhead hardware, requiring only one extra 4-1 multiplexer per multiply-accumulate unit (MAC). However, 2:4 sparsity is limited to 2x improvements in performance and energy without loss of accuracy, whereas unstructured sparsity provides 5-6x opportunity albeit while causing load imbalance. Previous papers on unstructured sparsity incur high hardware overhead (e.g., buffering, crossbars, scatter-gather networks, and address calculators) mainly due to sacrificing operand stationarity in favor of load balance. To avoid adding high overheads to the highly-efficient tensor cores, we propose Eureka, an efficient tensor core for unstructured sparsity. Eureka addresses load imbalance via three contributions: (1) Our key insight is that a slight weakening of output stationarity achieves load balance most of the time while incurring only a modest hardware overhead. Accordingly, we propose single-step uni-directional displacement (SUDS), where a filter element's multiplication can either occur in its original position or be displaced to a vacant MAC in the adjacent row below while the accumulation occurs in the original row to restore output stationarity. SUDS is an offline technique for inference. (2) We provide an optimal algorithm for work assignment for SUDS. (3) To achieve fewer bubbles in the tensor core's systolic pipeline due to the irregularity of unstructured sparsity, we propose offline systolic scheduling to group together the sparse filters with similar, statically-known execution times (based on the number of non-zeros). Our evaluation shows that Eureka achieves 4.8.. and 2.4.. speedups, and 3.1.. and 1.8.. energy reductions over dense and 2:4 sparse (Ampere) implementations, respectively, and incurs area and power overheads of 6% and 11.5%, respectively, over Ampere.
更多
查看译文
关键词
Tensor cores,deep neural network (DNN) inference. one-sided sparsity,unstructured sparsity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要