OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures.

MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture Taipei Taiwan October, 2016(2016)

引用 60|浏览83
暂无评分
摘要
As we integrate data-parallel GPUs with general-purpose CPUs on a single chip, the enormous cache traffic generated by GPUs will not only exhaust the limited cache capacity, but also severely interfere with CPU requests. Such heterogeneous multicores pose significant challenges to the design of shared last-level cache (LLC). This problem can be mitigated by replacing SRAM LLC with emerging non-volatile memories like Spin-Transfer Torque RAM (STT-RAM), which provides larger cache capacity and near-zero leakage power. However, without careful design, the slow write operations of STT-RAM may offset the capacity benefit, and the system may still suffer from contention in the shared LLC and on-chip interconnects. While there are cache optimization techniques to alleviate such problems, we reveal that the true potential of STT-RAM LLC may still be limited because now that the cache hit rate has been improved by the increased capacity, the on-chip network can become a performance bottleneck. CPU and GPU packets contend with each other for the shared network bandwidth. Moreover, the mixed-criticality read/write packets to STT-RAM add another layer of complexity to the network resource allocation. Therefore, being aware of the disparate latency tolerance of CPU/GPU applications and the asymmetric read/write latency of STT-RAM, we propose OSCAR to Orchestrate STT-RAM Caches traffic for heterogeneous ARchitectures. Specifically, an integration of asynchronous batch scheduling and priority based allocation for on-chip interconnect is proposed to maximize the potential of STT-RAM based LLC. Simulation results on a 28-GPU and 14-CPU system demonstrate an average of 17.4% performance improvement for CPUs, 10.8% performance improvement for GPUs, and 28.9% LLC energy saving compared to SRAM based LLC design.
更多
查看译文
关键词
OSCAR,STT-RAM cache traffic,heterogeneous CPU-GPU architectures,shared last-level cache,spin-transfer torque RAM,near-zero leakage power,heterogeneous multicores,shared network bandwidth,network resource allocation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要