Pulley: An Algorithm/Hardware Co-Optimization for In-Memory Sorting

IEEE Computer Architecture Letters(2022)

引用 0|浏览8
暂无评分
摘要
Sorting is an important kernel that requires many passes on data, where each pass imposes significant data movement overhead. Processing in memory (PIM) can reduce this data movement overhead while providing high parallelism. The radix sorting algorithm is scalable and can exploit PIM's parallelism. However, this algorithm is inefficient for current PIM-based accelerators for three reasons: (i) requiring a large intermediate array per processing unit, wasting capacity, (ii) requiring a prefix-sum operation across all the large intermediate arrays, imposing performance overhead, and (iii) requiring significant random accesses, which are costly in PIM. In this paper, we propose an algorithm and hardware co-optimization for sorting that enable every group of processing elements to cooperatively share and generate an intermediate array, reducing the capacity overhead of intermediate arrays and performance overhead of the prefix-sum operation. To prevent the shared array from becoming a bottleneck due to random accesses, we eliminate random accesses by adding a local sorting step to the radix sorting and providing efficient hardware support for this step. On average, our hardware/algorithm optimizations, Pulley, deliver 20× speedup compared to Bonsai, an FPGA-based sorting accelerator, and 13× speedup compared to IMC, an in-logic-layer-based sorting accelerator.
更多
查看译文
关键词
Processing in memory,PIM,sort,radix sorting
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要