Employing Multiple Levels of Parallelism for CFD at Large Scales on Next Generation High-Performance Computing Platforms

M. Howard,T. Fisher,M. Hoemmen, D. Dinzl,J. Overfelt, A. Bradley,K. Kim,S. Rajamanickam

semanticscholar(2018)

引用 1|浏览16
暂无评分
摘要
Computational fluid dynamics has traditionally been a significant benefactor of advances in high-performance computing (HPC). The computational cost of simulating the dynamics of a fluid in motion is significant and capturing the true physics of nearly any fluid flow for industrial applications requires simulation on the most advanced supercomputers available. For at least the past 20 years, there have been minimal changes in processor architectures between generations of supercomputers. For our purposes, we define a generation of a supercomputer to be 3 to 4 years. The status quo has been that each advancing generation brings about a faster processor, as measured by either higher clock frequencies or more cores per processor, but no fundamental changes in the processor architecture or the programming models needed to effectively employ the computing devices. This status quo has gradually started to change as the major computing vendors have introduced new processor technologies targeted for HPC applications. To illustrate this change it is instructive to examine the processing technologies of the Advanced Technology System (ATS) supercomputers that have been acquired by the U.S. National Nuclear Security Administration (NNSA) in recent years. The first of these machines, ATS-1 (named Trinity), has a traditional Xeon (Haswell processors) partition and another partition based on Xeon Phi processors (codenamed Knights Landing or KNL) and was put into production in 2017. The second of these machines, ATS-2 (named Sierra), is slated to be approximately 125 petaflop/s and is scheduled for production use in 2019. Sierra is truly a heterogenous machine, utilizing IBM’s Power9 processors and Nvidia Corporation’s Volta GPUs, with a significant fraction of the total flop/s coming from the GPUs. What is unique and disruptive about devices such as KNLs or GPUs is the trend to add more parallelism to each device either through threads and/or vector units. The Xeon Phi KNL processors in Trinity have 68 processing cores, each with 4 hardware threads and multiple 512-bit vector units, and require multiple levels of parallelism (vector operations as well as thread parallelism) to utilize as much of the approximately 3 teraflop/s theoretical peak available per processor. Furthermore, KNL processors have two memory spaces, an on-package high-bandwidth memory (HBM) in addition to traditional DDR memory. The high-bandwidth memory provides approximately 480 GB/s bandwidth while DDR memory provides approximately 90 GB/s bandwidth. The Volta GPUs in Sierra have a theoretical peak of approximately 7 teraflop/s double-precision and have 900 GB/s memory bandwidth. When combined with the Power9 CPUs, a compute node of Sierra has multiple execution spaces (CPUs and GPUs) and multiple memory spaces (host CPUs and GPUs). Further complicating the architectural changes are the many programming models that have arisen. It is fair to say that the Message Passing Interface (MPI) has been the de facto standard for inter-node parallelism for almost 30 years. However, MPI itself does little to address intra-node parallelism, that is the type of parallelism needed to effectively use Xeon Phi or GPU architectures by employing thread or vector parallelism. To address intra-node (or on-node) parallelism, OpenMP, CUDA, and OpenACC have
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要