Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads.

Proc. VLDB Endow.(2023)

引用 0|浏览1
暂无评分
摘要
Large models such as GPT-3 and ChatGPT have transformed deep learning (DL), powering applications that have captured the public's imagination. Such models must be trained on multiple GPUs due to their size and computational load, driving the development of a bevy of "model parallelism" techniques and tools. Navigating such parallelism choices, however, is a new burden for DL users such as data scientists, domain scientists, etc., who may lack the necessary systems knowhow. The need for model selection , which leads to many models to train due to hyper-parameter tuning or layer-wise finetuning, compounds the situation with two more burdens: resource apportioning and scheduling. In this work, we unify these three burdens by formalizing them as a joint problem that we call SPASE: Select a Parallelism, Allocate resources, and Schedule. We propose a new information system architecture to tackle the SPASE problem holistically, exploiting the performance opportunities presented by joint optimization. We devise an extensible template for existing parallelism schemes and combine it with an automated empirical profiler for runtime estimation. We then formulate SPASE as an MILP. We find that direct use of an MILP-solver is significantly more effective than several baseline heuristics. We optimize the system runtime further with an introspective scheduling approach. We implement all these techniques into a new data system we call Saturn. Experiments with benchmark DL workloads show that Saturn achieves 39-49% lower model selection runtimes than current DL practice.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要