MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Samuel Hsia, Alicia Golden,Bilge Acun,Newsha Ardalani,Zachary DeVito,Gu-Yeon Wei,David Brooks,Carole-Jean Wu

2024 ACM/IEEE 51ST ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, ISCA 2024（2024）

Harvard University

Cited 1|Views30

Abstract

Training and deploying large machine learning (ML) models is time-consuming and requires significant distributed computing infrastructures. Based on real-world large model training on datacenter-scale infrastructures, we show 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize the outstanding communication latency, in this work, we develop an agile performance modeling framework to guide parallelization and hardware-software co-design strategies. Using the suite of real-world large ML models on state-of-the-art GPU training hardware, we demonstrate 2.24x and 5.27x throughput improvement potential for pre-training and inference scenarios, respectively.

Translated text

Key words

Distributed Training,Distributed Inference,Machine Learning,Data Center,GPU,Performance Model,Simulator,Parallelization,Hardware-Software Co-Design

Bibtex

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

Summary is being generated by the instructions you defined