Benchmarking and Dissecting the Nvidia Hopper GPU Architecture
CoRR(2024)
摘要
Graphics processing units (GPUs) are continually evolving to cater to the
computational demands of contemporary general-purpose workloads, particularly
those driven by artificial intelligence (AI) utilizing deep learning
techniques. A substantial body of studies have been dedicated to dissecting the
microarchitectural metrics characterizing diverse GPU generations, which helps
researchers understand the hardware details and leverage them to optimize the
GPU programs. However, the latest Hopper GPUs present a set of novel
attributes, including new tensor cores supporting FP8, DPX, and distributed
shared memory. Their details still remain mysterious in terms of performance
and operational characteristics. In this research, we propose an extensive
benchmarking study focused on the Hopper GPU. The objective is to unveil its
microarchitectural intricacies through an examination of the new
instruction-set architecture (ISA) of Nvidia GPUs and the utilization of new
CUDA APIs. Our approach involves two main aspects. Firstly, we conduct
conventional latency and throughput comparison benchmarks across the three most
recent GPU architectures, namely Hopper, Ada, and Ampere. Secondly, we delve
into a comprehensive discussion and benchmarking of the latest Hopper features,
encompassing the Hopper DPX dynamic programming (DP) instruction set,
distributed shared memory, and the availability of FP8 tensor cores. The
microbenchmarking results we present offer a deeper understanding of the novel
GPU AI function units and programming features introduced by the Hopper
architecture. This newfound understanding is expected to greatly facilitate
software optimization and modeling efforts for GPU architectures. To the best
of our knowledge, this study makes the first attempt to demystify the tensor
core performance and programming instruction sets unique to Hopper GPUs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要