Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors.

SC(2013)

引用 53|浏览42
暂无评分
摘要
ABSTRACTThis paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nodes, which is 1.5x than achievable on a same number of Intel® Xeon® nodes. It is a challenge to fully utilize the compute capability presented by many-core wide-vector processors for bandwidth-bound FFT computation. We leverage a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running FFT on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging HPC systems that are increasingly communication limited.
更多
查看译文
关键词
coprocessors,fast Fourier transforms,multiprocessing systems,parallel architectures,HPC systems,Intel Xeon Phi coprocessors,TFLOPS,bandwidth-bound FFT computation,data movement optimization,disciplined performance programming methodology,low communication algorithm,low inter-node communication cost,low-communication algorithm,many-core wide-vector processors,node-local computations,parallel architecture,segment-of-interest FFT,tera-scale 1D FFT,tera-scale performance,Bandwidth Optimizations,Communication-Avoiding Algorithms,FFT,Wide-Vector Many-Core Processors,Xeon Phi,
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要