C 3 D: Mitigating the NUMA bottleneck via coherent DRAM caches

international symposium on microarchitecture(2016)

引用 0|浏览0
暂无评分
摘要
Massive datasets prevalent in scale-out, enterprise, and high-performance computing are driving a trend toward ever-larger memory capacities per node. To satisfy the memory demands and maximize performance per unit cost, today's commodity HPC and server nodes tend to feature multi-socket shared memory NUMA organizations. An important problem in these designs is the high latency of accessing memory on a remote socket that results in degraded performance in workloads with large shared data working sets. This work shows that emerging DRAM caches can help mitigate the NUMA bottleneck by filtering up to 98% of remote memory accesses. To be effective, these DRAM caches must be private to each socket to allow caching of remote memory, which comes with the challenge of ensuring coherence across multiple sockets and GBs of DRAM cache capacity. Moreover, the high access latency of DRAM caches, combined with high inter-socket communication latencies, can make hits to remote DRAM caches slower than main memory accesses. These features challenge existing coherence protocols optimized for on-chip caches with fast hits and modest storage capacity. Our solution to these challenges relies on two insights. First, keeping DRAM caches clean avoids the need to ever access a remote DRAM cache on a read. Second, a non-inclusive on-chip directory that avoids tracking blocks in the DRAM cache enables a light-weight protocol for guaranteeing coherence without the staggering directory costs. Our design, called Clean Coherent DRAM Caches (C3D), leverages these insights to improve performance by 6.4–50.7% in a quad-socket system versus a baseline without DRAM caches.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要