Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level
arxiv(2024)
摘要
Neighborhood attention reduces the cost of self attention by restricting each
token's attention span to its nearest neighbors. This restriction,
parameterized by a window size and dilation factor, draws a spectrum of
possible attention patterns between linear projection and self attention.
Neighborhood attention, and more generally sliding window attention patterns,
have long been bounded by infrastructure, particularly in higher-rank spaces
(2-D and 3-D), calling for the development of custom kernels, which have been
limited in either functionality, or performance, if not both. In this work, we
first show that neighborhood attention can be represented as a batched GEMM
problem, similar to standard attention, and implement it for 1-D and 2-D
neighborhood attention. These kernels on average provide 895
improvement in full precision latency compared to existing naive kernels for
1-D and 2-D neighborhood attention respectively. We find certain inherent
inefficiencies in all unfused neighborhood attention kernels that bound their
performance and lower-precision scalability. We also developed fused
neighborhood attention; an adaptation of fused dot-product attention kernels
that allow fine-grained control over attention across different spatial axes.
Known for reducing the quadratic time complexity of self attention to a linear
complexity, neighborhood attention can now enjoy a reduced and constant memory
footprint, and record-breaking half precision latency. We observe that our
fused kernels successfully circumvent some of the unavoidable inefficiencies in
unfused implementations. While our unfused GEMM-based kernels only improve half
precision performance compared to naive kernels by an average of 496
in 1-D and 2-D problems respectively, our fused kernels improve naive kernels
by an average of 1607
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要