训练小模型,推理大模型:面向大型语言模型的内存高效LoRA训练
Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models
February 19, 2025
作者: Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Yang You, Guiming Xie, Xuejian Gong, Kunlong Zhou
cs.AI
摘要
大型语言模型(LLMs)在自然语言处理领域取得了显著进展,展现出卓越的任务泛化能力。低秩适应(LoRA)提供了一种经济高效的微调方案,它冻结原始模型参数,仅训练轻量级的低秩适配矩阵。然而,LoRA的内存占用主要由原始模型参数主导。为缓解这一问题,我们提出了LoRAM,一种基于内存高效的LoRA训练方案,其核心思想在于:在过参数化的LLMs中,许多神经元虽训练效用低,但对推理至关重要。LoRAM引入了一个独特的设计:它在剪枝后的小模型上进行训练,获得剪枝后的低秩矩阵,随后将这些矩阵恢复并与原始大模型结合用于推理。此外,模型发布者预先进行的最低成本持续预训练,有效弥合了剪枝模型与原始模型间的知识差异。我们的大量实验验证了LoRAM在多种剪枝策略及下游任务中的有效性。对于拥有700亿参数的模型,LoRAM仅需20G HBM的GPU即可完成训练,替代了LoRA训练所需的A100-80G GPU及全量微调所需的15块GPU。特别地,结合结构化剪枝与4位量化的QLoRAM,在LLaMA-3.1-70B(LLaMA-2-70B)上,将低秩矩阵训练中主导内存使用的参数存储成本降低了15.81倍(16.95倍),同时在性能上显著超越了原始LLaMA-3.1-70B(LLaMA-2-70B)及LoRA训练的LLaMA-3.1-8B(LLaMA-2-13B)。
English
Large Language Models (LLMs) have significantly advanced natural language
processing with exceptional task generalization capabilities. Low-Rank Adaption
(LoRA) offers a cost-effective fine-tuning solution, freezing the original
model parameters and training only lightweight, low-rank adapter matrices.
However, the memory footprint of LoRA is largely dominated by the original
model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA
training scheme founded on the intuition that many neurons in
over-parameterized LLMs have low training utility but are essential for
inference. LoRAM presents a unique twist: it trains on a pruned (small) model
to obtain pruned low-rank matrices, which are then recovered and utilized with
the original (large) model for inference. Additionally, minimal-cost continual
pre-training, performed by the model publishers in advance, aligns the
knowledge discrepancy between pruned and original models. Our extensive
experiments demonstrate the efficacy of LoRAM across various pruning strategies
and downstream tasks. For a model with 70 billion parameters, LoRAM enables
training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA
training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by
structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B
(LLaMA-2-70B), reduces the parameter storage cost that dominates the memory
usage in low-rank matrix training by 15.81times (16.95times), while
achieving dominant performance gains over both the original LLaMA-3.1-70B
(LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B).Summary
AI-Generated Summary