A Fast Tridiagonal Solver for Intel MIC Architecture

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)(2016)

引用 12|浏览37
暂无评分
摘要
The tridiagonal solver is an important kernel in many scientific and engineering applications. Although quite a few parallel algorithms have been exploited recently, challenges still remain when solving tridiagonal systems on many-core architectures. In this paper, quantitative analysis is conducted to guide the selection of algorithms on different architectures, and a fast register-PCR-pThomas algorithm for Intel MIC architecture is presented to achieve the best utilization of in-core vectorization and registers. By choosing the most suitable number of PCR reductions, we further propose an improved algorithm, named register-PCR-half-pThomas algorithm, which minimizes the computational cost and the number of registers for use. The optimizations of manual prefetching and strength reduction are also discussed for the new algorithm. Evaluation on Intel Xeon Phi 7110P shows that our register-PCR-pThomas can outperform the MKL solver by 7.7 times in single precision and 3.4 times in double precision. Moreover, our optimized register-PCR-half-pThomas can further boost the performance by another 43.9% in single precision and 20.4% in double precision. Our best algorithm can outperform the latest official GPU library (cuSPARSE) on Nvidia K40 by 3.7 times and 2.4 times in single and double precision, respectively.
更多
查看译文
关键词
Tridiagonal solver,Intel MIC architecture,Vectorization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要