Demystifying BERT: System Design Implications

2022 IEEE International Symposium on Workload Characterization (IISWC)(2022)

引用 1|浏览17
暂无评分
摘要
Transfer learning in natural language processing (NLP) uses increasingly large models that tackle challenging problems. Consequently, these applications are driving the requirements of future systems. To this end, we study the computationally and time-intensive training phase of NLP models and identify how its algorithmic behavior can guide future accelerator design. We focus on BERT (Bi-directional Encoder Representations from Transformer), one of the most popular Transformer-based NLP models, and identify key operations which are worthy of attention in accelerator design. In particular, we focus on the manifestation, size, and arithmetic behavior of these operations which remain constant irrespective of hardware choice. Our results show that although computations which manifest as matrix multiplications dominate BERT’s execution, they have considerable heterogeneity. Furthermore, we characterize memory-intensive computations which also feature prominently in BERT but have received less attention. To capture future Transformer trends, we also show and discuss implications of these behaviors as networks get larger. Moreover, we study the impact of key training techniques like distributed training, check-pointing, and mixed-precision training. Finally, our analysis identifies holistic solutions to optimize systems for BERT-like models and we further demonstrate how enhancing compute-intensive accelerators with near-memory compute can help accelerate Transformer networks.
更多
查看译文
关键词
Deep Learning,Transformers,Characterization,Accelerator design,Near memory Computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要