Datacenter-Scale Analysis and Optimization of GPU Machine Learning Workloads

Lukasz Wesolowski,Bilge Acun,Valentin Andrei, Adnan Aziz, Gisle Dankel,Christopher Gregg,Xiaoqiao Meng, Cyril Meurillon, Denis Sheahan, Lei Tian, Janet Yang,Peifeng Yu,Kim Hazelwood

IEEE Micro(2021)

引用 3|浏览24
暂无评分
摘要
In this article, we present a system to collectively optimize efficiency in a very large scale deployment of GPU servers for machine learning workloads at Facebook. Our system 1) measures and stores system-wide efficiency metrics for every executed workflow; 2) aggregates data from across the execution stack to identify optimization opportunities that maximize fleet-wide efficiency improvements; 3...
更多
查看译文
关键词
Graphics processing units,Measurement,Telemetry,Tools,Social networking (online),Libraries,Training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要