Accelerating bandwidth-bound deep learning inference with main-memory accelerators


引用 17|浏览13
ABSTRACTMatrix-matrix multiplication operations (GEMMs) are important in many HPC and machine-learning applications. They are often mapped to discrete accelerators (e.g., GPUs) to improve performance. However, we find that large tall/skinny and fat/short matrices benefit little from discrete acceleration and also do not perform well on a CPU. Such matrices are prevalent in important workloads, such as deep-learning inference within large-scale datacenters. We demonstrate the large potential of accelerating these GEMMs with processing in the main CPU memory, where processing in memory units (PIMs) take advantage of otherwise untapped bandwidth without requiring data copies. We develop a novel GEMM execution flow and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels despite the complex address-mapping functions employed by the CPU. Our evaluation of StepStone variants at the channel, device, and within-device PIM levels demonstrate 12X better minimum latency than a CPU and 2.8X greater throughput for strict query latency constraints. End-to-end performance analysis of recent recommendation and language models shows that StepStone outperforms a fast CPU by up to 16X and also the best prior main-memory acceleration approaches by up to 2.4X.
AI 理解论文