Scaling and analyzing the stencil performance on multi-core and many-core architectures

Lin Gan,Haohuan Fu,Wei Xue,Yangtong Xu,Chao Yang,Xinliang Wang,Zihong Lv,Yang You,Guangwen Yang,Kaijian Ou

ICPADS（2014）

引用 16|浏览72

暂无评分

摘要

Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler's inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.

查看译文

关键词

graphics processing unit,multi-core architecture,wnad stencil,compiler,multicore architecture,vectorization,graphics processing units,many-core architecture,stencil performance,nvidia fermi c2070 gpu,optimizations,multiprocessing systems,intel sandy bridge processor,kepler k20x gpu,optimization,numerical simulation,multithreading,stencil,intel xeon phi coprocessor,program compilers,seven-point jacobi stencil

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要