Machine learning at the limit

Big Data(2015)

引用 23|浏览61
暂无评分
摘要
Many systems have been developed for machine learning at scale. Performance has steadily improved, but there has been relatively little work on explicitly defining or approaching the limits of performance. In this paper we describe the application of roofline design, an approach borrowed from computer architecture, to large-scale machine learning. In roofline design, one exposes ALU, memory, and network limits, and the constraints they imply for algorithms. Using roofline design, we have developed a system called BIDMach which has demonstrated the highest performance to date for many ML problems. On one GPU-accelerated node, it generally outperforms other single-machine toolkits and cluster toolkits running on 100s of nodes. This performance level is enabled by a relatively small number of rooflined matrix primitives. Such performance implies a dramatic reduction in the energy used to perform these calculations. Beyond matrix kernels, roofline design can be applied to the end-to-end design of machine learning algorithms which minimize memory usage to optimize speed. This approach offers a further 2x to 3x gain in performance. Roofline design can also be applied to network primitives. We describe recent work on a sparse allreduce primitive called Kylix. We have shown that Kylix approaches the practical network throughput limit for allreduce, a basic primitive for distributed machine learning. Using Kylix, we describe an efficient transformation from model-parallel to data-parallel calculations. This transformation uses a secondary storage roofline, with similar parameters to the network. Finally, we describe several deployments of these techniques on real-world problems in two large internet companies. Once again, single node rooflined design demonstrated substantial gains over alternatives on either single nodes or clusters.
更多
查看译文
关键词
Scalable Machine Learning,Big Data,Distributed Systems
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要