TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems(2019)

引用 148|浏览171
暂无评分
摘要
The use of increasingly larger and more complex neural networks (NNs) makes it critical to scale the capabilities and efficiency of NN accelerators. Tiled architectures provide an intuitive scaling solution that supports both coarse-grained parallelism in NNs: intra-layer parallelism, where all tiles process a single layer, and inter-layer pipelining, where multiple layers execute across tiles in a pipelined manner. This work proposes dataflow optimizations to address the shortcomings of existing parallel dataflow techniques for tiled NN accelerators. For intra-layer parallelism, we develop buffer sharing dataflow that turns the distributed buffers into an idealized shared buffer, eliminating excessive data duplication and the memory access overheads. For inter-layer pipelining, we develop alternate layer loop ordering that forwards the intermediate data in a more fine-grained and timely manner, reducing the buffer requirements and pipeline delays. We also make inter-layer pipelining applicable to NNs with complex DAG structures. These optimizations improve the performance of tiled NN accelerators by 2x and reduce their energy consumption by 45% across a wide range of NNs. The effectiveness of our optimizations also increases with the NN size and complexity.
更多
查看译文
关键词
dataflow, neural networks, parallelism
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要