BERTPerf: Inference Latency Predictor for BERT on ARM big.LITTLE Multi-Core Processors

2022 IEEE Workshop on Signal Processing Systems (SiPS)(2022)

引用 0|浏览12
暂无评分
摘要
Hardware-aware Neural Architecture Search (NAS) and mapping & scheduling optimization methods are being used to find efficient implementations of computationally-intense language models such as BERT. This requires measuring real hardware inference latency: good design decisions simply cannot be made with proxy metrics such as FLOPs or the number of parameters. However, the time required to perform on-device latency measurements is prohibitive (e.g., a few days to a few weeks over the course of an optimization run). To address this, we present BERTPerf, a low-cost, highly-accurate method to predict the inference time of BERT on ARM big.LITTLE multi-core processors. BERTPerf exploits latency patterns at the layer-level to reduce on-device latency measurements, and captures the effect of caching and intermediate tensor allocations to reduce latency prediction error. BERTPerf reduces the maximum prediction error by 7–11% compared to the state-of-the-art, and requires 75% less on-device measurements compared to existing work at the same prediction error.
更多
查看译文
关键词
BERTPerf,inference latency predictor,scheduling optimization methods,computationally-intense language models,on-device latency measurements,maximum latency prediction error,hardware inference time,ARM big.LITTLE multicore processors,FLOP
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要