Training-Free Long-Context Scaling of Large Language Models

ICML 2024(2024)

Cited 0|Views16
No score
Abstract
The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose a training-free approach named Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of up to 100k tokens. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of models built through continual training. All code and data used in this work are released at https://github.com/HKUNLP/ChunkLlama.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined