PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures
arxiv(2024)
摘要
Processing-using-DRAM (PUD) architectures impose a restrictive data layout
and alignment for their operands, where source and destination operands (i)
must reside in the same DRAM subarray (i.e., a group of DRAM rows sharing the
same row buffer and row decoder) and (ii) are aligned to the boundaries of a
DRAM row. However, standard memory allocation routines (i.e., malloc,
posix_memalign, and huge pages-based memory allocation) fail to meet the data
layout and alignment requirements for PUD architectures to operate
successfully. To allow the memory allocation API to influence the OS memory
allocator and ensure that memory objects are placed within specific DRAM
subarrays, we propose a new lazy data allocation routine (in the kernel) for
PUD memory objects called PUMA. The key idea of PUMA is to use the internal
DRAM mapping information together with huge pages and then split huge pages
into finer-grained allocation units that are (i) aligned to the page address
and size and (ii) virtually contiguous.
We implement PUMA as a kernel module using QEMU and emulate a RISC-V machine
running Fedora 33 with v5.9.0 Linux Kernel. We emulate the implementation of a
PUD system capable of executing row copy operations (as in RowClone) and
Boolean AND/OR/NOT operations (as in Ambit). In our experiments, such an
operation is performed in the host CPU if a given operation cannot be executed
in our PUD substrate (due to data misalignment). PUMA significantly outperforms
the baseline memory allocators for all evaluated microbenchmarks and allocation
sizes.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要