swFLOW: A large-scale distributed framework for deep learning on Sunway TaihuLight supercomputer

Information Sciences(2021)

引用 7|浏览26
暂无评分
摘要
Deep learning technology is widely used in many modern fields and a number of models and software frameworks have been proposed. However, it is still very difficult to process deep learning tasks efficiently on traditional high performance computing (HPC) systems. In this paper, we propose swFLOW: a large-scale distributed framework for deep learning on Sunway TaihuLight. Based on the performance analysis results of convolutional neural network (CNN), we optimize the convolutional layer, and get 10.42× speedup compared to the original version. As for distributed training, we use elastic averaging stochastic gradient descent (EASGD) algorithm to reduce communication. On 512 processes, we get a parallel efficiency of 81.01% with communication period τ=8. Particularly, a decentralized implementation of distributed swFLOW system is presented to alleviate bottleneck of the central server. By using distributed swFLOW system, we can scale the batch size up to 4096 among 1024 concurrent processes for cancerous region detection algorithm. The successful application on swFLOW reveals the great opportunity for joint combination of deep learning and HPC system.
更多
查看译文
关键词
Deep learning,High performance computing,Convolutional neural networks,Cancerous region detection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要