FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGA
CoRR(2024)
摘要
Transformer-based Large Language Models (LLMs) have made a significant impact
on various domains. However, LLMs' efficiency suffers from both heavy
computation and memory overheads. Compression techniques like sparsification
and quantization are commonly used to mitigate the gap between LLM's
computation/memory overheads and hardware capacity. However, existing GPU and
transformer-based accelerators cannot efficiently process compressed LLMs, due
to the following unresolved challenges: low computational efficiency,
underutilized memory bandwidth, and large compilation overheads.
This paper proposes FlightLLM, enabling efficient LLMs inference with a
complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative
solution that the computation and memory overhead of LLMs can be solved by
utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory
hierarchy). We propose a configurable sparse DSP chain to support different
sparsity patterns with high computation efficiency. Second, we propose an
always-on-chip decode scheme to boost memory bandwidth with mixed-precision
support. Finally, to make FlightLLM available for real-world LLMs, we propose a
length adaptive compilation method to reduce the compilation overhead.
Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0×
higher energy efficiency and 1.8× better cost efficiency against
commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using
vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100
GPU with 1.2× higher throughput using the latest Versal VHK158 FPGA.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要