Performance of Distributed Deep Learning Workloads on a Composable Cyberinfrastructure.

Zhenhua He, Aditi Saluja,Richard Lawrence,Dhruva Chakravorty,Francis Dang,Lisa Perez,Honggao Liu

PEARC（2023）

引用 0|浏览2

暂无评分

摘要

The next generation of computing systems are likely to rely on disaggregated resources that can be dynamically reconfigured and customized for researchers to support scientific and engineering workflows that require different cyberinfrastructure (CI) technologies. These resources would include memory, accelerators, co-processors among other technologies. This would represent a significant shift in High Performance Computing (HPC) from the now typical model of clusters that have these resources permanently connected to a single server. While composing hardware frameworks with disaggregated resources holds promise, we need to understand how to situate workflows on these resources and evaluate the impact of this approach on workflow performance against “traditional” clusters. Toward developing this knowledge framework, we study the applicability and performance of deep learning workloads on GPU-enabled composable and traditional HPC computing platforms. Results from tests performed using the Horovod framework with TensorFlow and PyTorch models on these HPC environments are presented here.

查看译文

关键词

distributed deep learning workloads

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要