LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
arxiv(2024)
摘要
The machine learning community has witnessed impressive advancements since
the first appearance of large language models (LLMs), yet their huge memory
consumption has become a major roadblock to large-scale training. Parameter
Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA) have been
proposed to alleviate this problem, but their performance still fails to match
full parameter training in most large-scale fine-tuning settings. Attempting to
complement this deficiency, we investigate layerwise properties of LoRA on
fine-tuning tasks and observe an uncommon skewness of weight norms across
different layers. Utilizing this key observation, a surprisingly simple
training strategy is discovered, which outperforms both LoRA and full parameter
training in a wide range of settings with memory costs as low as LoRA. We name
it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA,
which applies the idea of importance sampling to different layers in LLMs and
randomly freeze most middle layers during optimization. Experimental results
show that with similar or less GPU memory consumption, LISA surpasses LoRA or
even full parameter tuning in downstream fine-tuning tasks, where LISA
consistently outperforms LoRA by over 11%-37% in terms of MT-Bench
scores. On large models, specifically LLaMA-2-70B, LISA achieves on-par or
better performance than LoRA on MT-Bench, GSM8K, and PubMedQA, demonstrating
its effectiveness across different domains.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要