CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion.
CoRR(2023)
摘要
In this paper we analyze the MPI-only version of the CloverLeaf code from the
SPEChpc 2021 benchmark suite on recent Intel Xeon "Ice Lake" and "Sapphire
Rapids" server CPUs. We observe peculiar breakdowns in performance when the
number of processes is prime. Investigating this effect, we create
first-principles data traffic models for each of the stencil-like hotspot
loops. With application measurements and microbenchmarks to study memory data
traffic behavior, we can connect the breakdowns to SpecI2M, a new
write-allocate evasion feature in current Intel CPUs. We identify conditions
under which SpecI2M works as intended and where it fails to avoid
write-allocate transfers. Write-allocate evasion works best if large arrays are
written consecutively; in the CloverLeaf code, non-temporal stores can be
employed on top for best results. For serial and full-node cases we are able to
predict the memory data volume analytically with an error of a few percent. We
find that if the number of processes is prime, SpecI2M fails to work properly,
which we can attribute to short inner loops emerging from the one-dimensional
domain decomposition in this case. We can also rule out other possible causes
of the prime number effect, such as breaking layer conditions, MPI
communication overhead, and load imbalance.
更多查看译文
关键词
multi-core,write-allocate
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要