Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition

CoRR(2024)

引用 0|浏览5
暂无评分
摘要
We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference, where we perform dequantization and GEMM in a fused kernel using a SplitK work decomposition. Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads. In particular, this paper surveys the type of matrix multiplication between a skinny activation matrix and a square weight matrix. Our results show an average of 65 and an average of 124 range of matrix dimensions including those found in a llama-style model, where m < n = k.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要