Architecture supported register stash for GPGPU

Journal of Parallel and Distributed Computing(2016)

引用 5|浏览121
暂无评分
摘要
GPGPU provides abundant hardware resources to support a large number of light-weighted threads. They are organized into blocks and run in warps. All threads of a block must be dispatched to one stream multiprocessor (SM) of GPGPU together. When the remaining resources of an SM cannot support one more block, all threads of the block are held back until former blocks retire from the SM. We found that the register file is prone to be the most limited one among all the resources, especially for SMs with less registers. Meanwhile, we revealed the dynamics of a thread's register requirement: only part of its pre-allocated registers are used for different instructions at run time. This results in considerable register underutilization.We proposed the architecture supported register stash (ASRS). It removes the limitation of registers when dispatching blocks. The hardware registers are allocated at run time according to each instruction's live registers, which can be analyzed statically by a compiler. When the hardware registers cannot meet the requirements of all running warps, some warps are suspended and their registers are reclaimed temporarily. The data in these registers are stashed to memory. On the other hand, if there are spare hardware registers, it will start a new warp or resume a suspended warp after all the warp's stashed register data are loaded from memory. The intra-block synchronization is also taken care of when some of the warps of the same block are not schedulable due to the ASRS.The ASRS alleviates the register underutilization and improves performance without modifying the current programming model or demanding extra effort from the programmers. It also enables an SM with limited registers that cannot even support a single block to execute it. Besides, it helps lower the register file energy consumption and increase the power efficiency. The ASRS achieved speedups of 1.59 and 1.14 when the registers of each SM are limited to 8K and 16K respectively with an insignificant overhead. The speedups compared with the infinite register files are 0.84 and 0.98 with 8K and 16K registers respectively. Compared with the baseline 32K register file, the ASRS decreases the 8K and 16K register file energy consumption to 66.5% and 75.8% respectively. Their power efficiencies (in ratio of performance and power) are increased to 1.29x¿and 1.31x¿respectively. Register requirement of GPGPU varies among different kernels and during run time.Register file (RF) capacity limits the schedulable warps and performance.Reducing RF capacity lowers energy consumption and area.We proposed a method to support more warps with limited registers.It gains significant speedup and a higher energy efficiency with a smaller RF.
更多
查看译文
关键词
Register file,GPGPU,Performance,Energy
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要