SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters

Hanyu Zhao,Zhenhua Han,Zhi Yang,Quanlu Zhang,Mingxia Li,Fan Yang,Qianxi Zhang,Binyang Li,Yuqing Yang,Lili Qiu,Lintao Zhang,Lidong Zhou

EuroSys（2023）

引用 1|浏览94

暂无评分

摘要

Deep learning training on cloud platforms usually follows the tradition of the separation of storage and computing. The training executes on a compute cluster equipped with GPUs/TPUs while reading data from a separate cluster hosting the storage service. To alleviate the potential bottleneck, a training cluster usually leverages its local storage as a cache to reduce the remote IO from the storage cluster. However, existing deep learning schedulers do not manage storage resources thus fail to consider the diverse caching effects across different training jobs. This could degrade scheduling quality significantly. To address this issue, we present SiloD, a scheduling framework that co-designs the cluster scheduler and the cache subsystems for deep learning training. SiloD treats cache and remote IO as first-class resources and can integrate different state-of-the-art deep learning scheduling policies in a unified scheduling framework. To achieve this, SiloD develops an enhanced job performance estimator to help different schedulers to jointly consider the impact of storage and compute resource allocation while preserving their respective scheduling objectives. The SiloD-enhanced performance estimator leverages the unique data access pattern of deep learning training to develop a closed-form analytic model that captures the diverse cache / remote IO requirements from different training jobs. Evaluations show that SiloD improves the average job completion time, cluster utilization, and fairness by up to 7.4x, 2.57x, and 1.89x, respectively, compared to different combinations of cache systems and cluster schedulers where they operate independently.

查看译文

关键词

Machine learning systems,cloud computing,cache systems

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要