Network-accelerated distributed machine learning for multi-tenant settings

SoCC '20: ACM Symposium on Cloud Computing Virtual Event USA October, 2020(2020)

引用 8|浏览72
暂无评分
摘要
Many distributed machine learning (DML) workloads are increasingly being run in shared clusters. Training in such clusters can be impeded by unexpected compute and network contention, resulting in stragglers. We present MLfabric, a contention-aware DML system that manages the performance of a DML job running in a shared cluster. The DML application hands all network communication (gradient and model transfers) to the MLfabric communication library. MLfabric then carefully orders transfers to improve convergence, opportunistically aggregates them at idle DML workers to improve resource efficiency, and replicates them to support new notions of fault tolerance, while systematically accounting for compute stragglers and network contention. We find that MLfabric achieves up to 3x speed-up in training large deep learning models in realistic dynamic cluster settings.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要