ScaMP: Scalable Meta-Parallelism for Deep Learning Search

Quentin Anthony,Lang Xu,Aamir Shafi,Hari Subramoni,Dhabaleswar K. DK Panda

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)（2023）

引用 0|浏览29

暂无评分

摘要

Deep Learning (DL) models are growing exponentially and require increasingly powerful High Performance Computing (HPC) systems to train them. Achieving state-of-the-art results requires carefully tuning the DL model architecture and training settings, which is a time-consuming process commonly relegated to distributed search frameworks and trial-and-error. However, search frameworks don't provide a flexible parallelism scheme within and among the chosen DL framework for modern out-of-core DL models. In this paper, we propose Scalable Meta-Parallelism for Deep Learning Search (ScaMP): a distributed Hyperparameter Optimization (HPO) and Neural Architecture Search (NAS) framework that supports out-of-core models with flexible parallelism schemes. SCaMP is integrated into the modern DL ecosystem, and enables both efficient parallel training of concurrent candidate architectures and aggregate device memory saturation via a powerful load balancing engine. SCaMP estimates the memory requirements of each candidate architecture and automatically applies the appropriate model-parallel degree and maximum batch size supported for the given candidate. Further, HPO and NAS with SCaMP are highly customizable via flexible configuration options. We evaluate the benefits of our designs on synthetic training benchmarks and in training a state-of-the-art vision transformer model. We select transformers as a candidate DL model type and demonstrate a 29% improvement in end-to-end HPO time on 32 V100 GPUs on the Lassen and ThetaGPU HPC systems. Further, we demonstrate a reduction in the proportion of NAS time spent in communication from 28% to 15%. Finally, we thoroughly verify the correctness of SCaMP by training a state-of-the-art SwinIR model.

查看译文

关键词

Neural Networks, DNN, MPI, GPU

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要