Characterization of Large Language Model Development in the Datacenter
arxiv(2024)
摘要
Large Language Models (LLMs) have presented impressive performance across
several transformative tasks. However, it is non-trivial to efficiently utilize
large-scale cluster resources to develop LLMs, often riddled with numerous
challenges such as frequent hardware failures, intricate parallelization
strategies, and imbalanced resource utilization. In this paper, we present an
in-depth characterization study of a six-month LLM development workload trace
collected from our GPU datacenter Acme. Specifically, we investigate
discrepancies between LLMs and prior task-specific Deep Learning (DL)
workloads, explore resource utilization patterns, and identify the impact of
various job failures. Our analysis summarizes hurdles we encountered and
uncovers potential opportunities to optimize systems tailored for LLMs.
Furthermore, we introduce our system efforts: (1) fault-tolerant pretraining,
which enhances fault tolerance through LLM-involved failure diagnosis and
automatic recovery. (2) decoupled scheduling for evaluation, which achieves
timely performance feedback via trial decomposition and scheduling
optimization.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要