Si-Kintsugi: Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI

Edward Hanson,Shiyu Li, Guanglei Zhou, Feng Cheng, Yitu Wang, Rohan Bose, Hai (Helen) Li,Yiran Chen

56TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, MICRO 2023(2023)

引用 0|浏览1
暂无评分
摘要
The growing demand for higher compute and memory capacity driven by artificial intelligence (AI) applications pushes higher core counts in modern systems. Many-core architectures exhibiting spatial interconnects with high on-chip bandwidth are ideal for these workloads due to their data movement flexibility and sheer parallelism. However, the size of such platforms makes them particularly susceptible to manufacturing defects, prompting a need for designs and mechanisms that improve yield. Despite these techniques, nonfunctional cores and links are unavoidable. Although prior works address defective cores by disabling them and only schedulingworkload to functional ones, communication latency through spatial interconnects is tightly associated with the locations of defective cores and cores with assigned work. Based on this observation, we present Si-Kintsugi, a defect-aware workload scheduling framework for spatial architectures with mesh topology. First, we design a novel and generalizable workload mapping representation and cost function that integrates defect pattern information. The mapping representation is formed into a 1D vector with simple constraints, making it an ideal candidate for open source heuristic-based optimization algorithms. After a communication latency optimized workload mapping is found, dataflow between the mapped cores is automatically generated to balance communication and computation cost. Si-Kintsugi is extensively evaluated on various workloads (i.e., BERT, ResNet, GEMM) across a wide range of defect patterns and rates. Experiment results show that Si-Kintsugi generates a workload schedule that is on average 1.34x faster than the industry standard layer-pipelined schedule on defective platforms.
更多
查看译文
关键词
multi-core architectures,spatial network-on-chip,defective cores,workload scheduling,AI acceleration
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要