Datacenter-Scale Analysis and Optimization of GPU Machine Learning Workloads
IEEE Micro(2021)
摘要
In this article, we present a system to collectively optimize efficiency in a very large scale deployment of GPU servers for machine learning workloads at Facebook. Our system 1) measures and stores system-wide efficiency metrics for every executed workflow; 2) aggregates data from across the execution stack to identify optimization opportunities that maximize fleet-wide efficiency improvements; 3...
更多查看译文
关键词
Graphics processing units,Measurement,Telemetry,Tools,Social networking (online),Libraries,Training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要