LS-HTC: an HTC system for large-scale jobs

Juncheng Hu,Xilong Che, Bowen Kan, Yuhan Shao

CCF Transactions on High Performance Computing(2024)

引用 0|浏览0
暂无评分
摘要
High throughput computing (HTC) uses mass computing resources over long periods of time to accomplish a batch of short fast jobs, it is widely employed by Simulation Computation such as Earth Science, Materials Science, Biomedical Science to process large scale simulation tasks. When the number of jobs reaches a large-scale level, such as millions or tens of millions, the scheduling and management of massive tasks will bring great burden to the high performance computing (HPC) cluster. Therefore, an HTC system that supports large-scale jobs with few impact on HPC cluster becomes an urgent need for these communities. To address this problem, we propose an LS-HTC system which can schedule million-level jobs and million-level computing resources. The architecture and workflow of LS-HTC is designed, and a two-level scheduling solution is provided for large-scale jobs execution. Prototype system is achieved then evaluated using more than 20 million jobs and 8000 compute nodes and 128,000 CPU cores at our HPC cluster. Experimental results indicate that the LS-HTC system can take best usage of computing resources by dynamically adjusting the sum of compute nodes according to the sum of jobs with negligible influence on shared storage system and management system of HPC cluster.
更多
查看译文
关键词
HTC,Large-scale,Low overhead,Scheduling,HPC cluster
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要