작게 학습하고 크게 추론: 대규모 언어 모델을 위한 메모리 효율적인 LoRA 학습

초록

대규모 언어 모델(LLMs)은 뛰어난 작업 일반화 능력으로 자연어 처리 분야를 크게 발전시켰습니다. 저랭크 적응(LoRA)은 원본 모델 파라미터를 고정하고 경량의 저랭크 어댑터 행렬만을 훈련시키는 비용 효율적인 미세 조정 솔루션을 제공합니다. 그러나 LoRA의 메모리 사용량은 주로 원본 모델 파라미터에 의해 지배됩니다. 이를 완화하기 위해, 우리는 과매개변수화된 LLMs의 많은 뉴런이 훈련 유용성은 낮지만 추론에는 필수적이라는 직관에 기반한 메모리 효율적인 LoRA 훈련 기법인 LoRAM을 제안합니다. LoRAM은 독특한 접근 방식을 취합니다: 가지치기된 (작은) 모델에서 훈련하여 가지치기된 저랭크 행렬을 얻은 후, 이를 원본 (큰) 모델과 함께 복원하여 추론에 활용합니다. 또한, 모델 출판사가 사전에 수행하는 최소 비용의 지속적 사전 훈련은 가지치기된 모델과 원본 모델 간의 지식 차이를 조정합니다. 우리의 광범위한 실험은 다양한 가지치기 전략과 다운스트림 작업에서 LoRAM의 효율성을 입증합니다. 700억 개의 파라미터를 가진 모델의 경우, LoRAM은 20G HBM의 GPU에서 훈련을 가능하게 하여 LoRA 훈련을 위한 A100-80G GPU와 전체 미세 조정을 위한 15개의 GPU를 대체합니다. 특히, 구조적 가지치기와 4비트 양자화를 결합한 QLoRAM은 LLaMA-3.1-70B (LLaMA-2-70B)의 경우, 저랭크 행렬 훈련에서 메모리 사용을 지배하는 파라미터 저장 비용을 15.81배 (16.95배) 줄이면서도 원본 LLaMA-3.1-70B (LLaMA-2-70B)와 LoRA로 훈련된 LLaMA-3.1-8B (LLaMA-2-13B) 모두를 압도하는 성능 향상을 달성합니다.

English

Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA training scheme founded on the intuition that many neurons in over-parameterized LLMs have low training utility but are essential for inference. LoRAM presents a unique twist: it trains on a pruned (small) model to obtain pruned low-rank matrices, which are then recovered and utilized with the original (large) model for inference. Additionally, minimal-cost continual pre-training, performed by the model publishers in advance, aligns the knowledge discrepancy between pruned and original models. Our extensive experiments demonstrate the efficacy of LoRAM across various pruning strategies and downstream tasks. For a model with 70 billion parameters, LoRAM enables training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B (LLaMA-2-70B), reduces the parameter storage cost that dominates the memory usage in low-rank matrix training by 15.81times (16.95times), while achieving dominant performance gains over both the original LLaMA-3.1-70B (LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B).

작게 학습하고 크게 추론: 대규모 언어 모델을 위한 메모리 효율적인 LoRA 학습

Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models

초록

Summary

Support