Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control
CoRR(2024)
摘要
Transformer-based models are becoming deeper and larger recently. For better
scalability, an underlying training solution in industry is to split billions
of parameters (tensors) into many tasks and then run them across homogeneous
accelerators (e.g., GPUs). However, such dedicated compute cluster is
prohibitively expensive in academia and moderate companies. An economic
replacement is to aggregate existing heterogeneous devices and share resources
among multi-tenants. Nevertheless, static hardware configurations and dynamic
resource contention definitely cause straggling tasks, which heavily slows down
the overall training efficiency. Existing works feature contributions mainly
tailored for traditional data parallelism. They cannot work well for the new
tensor parallelism due to strict communication and correctness constraints.
In this paper we first present ZERO-resizing, a novel dynamic workload
balancing technique without any data migration. We tune workloads in real-time
by temporarily resizing matrices involved in core tensor-related computations.
We particularly design data imputation and priority selection policies to
respectively satisfy consistency constraint required by normal training and
reduce the accuracy loss. We also give a lightweight data migration technique
without loss of accuracy, to cope with heavy heterogeneity. Our final
SEMI-migration solution is built on top of these two techniques and can
adaptively distinguish their respective balancing missions, to achieve an
overall success in efficiency and accuracy. Extensive experiments on the
representative Colossal-AI platform validate the effectiveness of our
proposals.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要