Newton: A DRAM-maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning

2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)(2020)

引用 100|浏览43
暂无评分
摘要
Advances in machine learning (ML) have ignited hardware innovations for efficient execution of the ML models many of which are memory-bound (e.g., long short-term memories, multi-level perceptrons, and recurrent neural networks). Specifically, inference using these ML models with small batches, as would be the case at the Cloud edge, has little reuse of the large filters and is deeply memory-bound. Simultaneously, processing-in or -near memory (PIM or PNM) is promising unprecedented high-bandwidth connection between compute and memory. Fortunately, the memory-bound ML models are a good fit for PIM. We focus on digital PIM which provides higher bandwidth than PNM and does not incur the reliability issues of analog PIM. Previous PIM and PNM approaches advocate full processor cores which do not conform to PIM’s severe area and power constraints. We describe Newton, a major DRAM maker’s upcoming accelerator-in-memory (AiM) product for machine learning, which makes the following contributions: (1) To satisfy PIM’s area constraints, Newton (a) places a minimal compute of only multiply-accumulate units and buffers in the DRAM which avoids the full-core area and power overheads of previous work and thus makes PIM feasible for the first time, and (b) employs a DRAM-like interface for the host to issue commands to the PIM compute. The PIM compute is rate-matched to the internal DRAM bandwidth and employs a non-intuitive, global input vector buffer shared by the entire channel to capture input reuse while amortizing buffer area cost. To the host, Newton’s interface is indistinguishable from regular DRAM without any offloading overheads and PIM/non-PIM mode switching, and with the same deterministic latencies even for floating-point commands. (2) To prevent the PIM-host interface from becoming a bottleneck, we include three optimizations: commands which gang multiple compute operations both within a bank and across banks; complex, multi-step compute commands – both of which save critical command bandwidth; and targeted reduction of t FAW overhead. (3) To capture output vector reuse with reasonable buffering, Newton employs an unusually-wide interleaved layout for the matrix. Our simulations running state-of-the-art neural networks show that building on a realistic HBM2E-like DRAM, Newton achieves 10x and 54x average speedup over a non-PIM system with infinite compute that perfectly uses the external DRAM bandwidth and a realistic GPU, respectively.
更多
查看译文
关键词
processing-in-memory,fully-connected neural networks,DRAM
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要