SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2022(2022)

引用 0|浏览37
暂无评分
摘要
SU3_Bench explores performance portability across multiple programming models using a simple but nontrivial mathematical kernel. This kernel has been derived from the Lattice Quantum Chromodynamics (LQCD) code used in applications such as Hadron Physics and hence should be of interest to the scientific community. SU3_Bench has a regular compute and data access pattern and on most traditional CPU and GPU-based systems, its performance is mainly determined by the achievable memory bandwidth. However, this paper shows that on the new Intel Programmable Integrated Unified Memory Architecture (PIUMA) that is designed for sparse workloads and has a balanced flops-to-byte ratio with scalar cores, SU3_Bench's performance is determined by the total number of instructions that can be executed per cycle (pipeline throughput) rather than the usual bandwidth or flops. We show the performance analysis, porting, and optimizations of SU3_Bench on the PIUMA architecture and discuss how they are different from the standard NUMA CPUs (e.g., Xeon required NUMA optimizations whereas, on PIUMA, it was not necessary). We show iso-bandwidth and iso-power comparisons of SU3_Bench for PIUMA vs Xeon. We also show performance efficiency comparisons of SU3_Bench on PIUMA, Xeon, GPUs, and FPGAs based on pre-existing data. The lessons learned are generalizable to other similar kernels.
更多
查看译文
关键词
SU3_Bench, SU3, LQCD, QCD, PIUMA, Quantum chromodynamics
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要