Managing a Heterogeneous Cluster

Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning)(2019)

引用 3|浏览5
暂无评分
摘要
Most HPC clusters are purchased with a large quantity of identical hardware, which is maintained through its lifecycle and then another HPC cluster takes its place. However, some clusters, like ours, are maintained by frequently adding new hardware, which is then integrated into the system. Over the years, the cluster has grown to include 300+ compute nodes with 8000+ cores from 6 vendors, spanning 5 generations of CPUs; 7 network technologies from 6 switch vendors (1Gbps-100Gbps, including Ethernet, Infiniband, and OmniPath); 102 GPUs (3 different GPU models); 28 storage nodes (3+ PB raw storage); and 7 virtualization nodes hosting 65 VMs. Having such a diverse system has significant advantages, although the management is more difficult. This paper outlines our strategy of managing this very heterogeneous and complex system. Topics covered include software optimization, consistency of operating system updates, identity management, resource pri-oritization, network infrastructure, storage, and management of non-compute-intensive resources. Our combination of open source and internally developed software used to manage this cluster are a model to other heterogeneous systems and to smaller clusters which have not expanded because of management worries.
更多
查看译文
关键词
heterogeneous clusters, resource management, system administration
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要