Tree Cache – A Novel Approach to Non-Uniform Access Latency Cache Architectures for 3 D CMPs

The Pennsylvania State University, Tech. Rep. CSE-09-017(2009)

引用 1|浏览0
暂无评分
摘要
We consider a non-uniform access latency cache architecture (NUCA) design for 3D chip multiprocessors (CMPs) where cache structures are divided into small banks interconnected by a network-on-chip (NoC). In earlier NUCA designs, data is placed in banks either statically (S-NUCA) or dynamically (D-NUCA). In both SNUCA and D-NUCA designs, scaling to hundreds of cores can pose several challenges. In S-NUCA, bank contention can develop when a large number of application threads compete for the same bank. In D-NUCA, with both broadcast and sequential lookup schemes, all banks, all routers, and a significant portion of the NoC links are accessed on a cache miss (on hit sequential scheme may access fewer banks). We propose a new NUCA architecture with an inclusive, tree-based, hierarchical directory (T-NUCA), with the potential to scale to hundreds of cores with performance comparable to D-NUCA at a fraction of the energy cost. We develop two T-NUCA implementations, TNUCA-2 (binary) and T-NUCA-8 (octal) toward improved performance and energy trade-offs. We simulate these caches on a system with 64-cores, and focus on both singleprogram and multi-program environments where many applications are competing for cache resources. Our evaluations indicate that in the single-program environment, the T-NUCA-8 cache can reduce execution time by up to 27% over S-NUCA for workloads that have low L2 cache miss rates. In a multi-program environment, where 8 different applications are mapped to 64 cores, T-NUCA-8 can result in significant performance and energy benefits. Relative to S-NUCA, T-NUCA-8 improves the performance and EDP by 70% and 25% respectively, with an energy consumption increase of 150%. Relative to D-NUCA, our T-NUCA-8 reduces network usage by 92%, energy by 87%, and EDP by 87%, at performance cost of 10%. Finally, relative to a 24MB D-NUCA, our 16MB T-NUCA-2 has 7% better performance, and energy and EDP are factors of 8.57 and 9.17 times lower.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要