대용량 언어 모델에서 손실을 줄이세요.

초록

언어 모델이 점점 커지면 어휘도 커집니다. 이로 인해 LLM의 메모리 풋프린트가 훈련 중에 한 가지 단일 레이어, 즉 손실 계산에서의 교차 엔트로피로 불균형하게 이동했습니다. 교차 엔트로피는 입력 토큰과 어휘 항목 각 쌍에 대한 로짓 행렬을 구축하며, 작은 모델의 경우 LLM의 나머지 부분보다 메모리를 한 순위 더 많이 소비합니다. 우리는 모든 토큰에 대한 로짓을 전역 메모리로 구체화하지 않고 교차 엔트로피 손실을 계산하는 방법인 Cut Cross-Entropy (CCE)를 제안합니다. 대신 CCE는 올바른 토큰에 대한 로짓만 계산하고 모든 로짓에 대한 로그-합-지수를 실시간으로 평가합니다. 우리는 플래시 메모리에서 어휘에 대한 행렬 곱셈과 로그-합-지수 축소를 수행하는 사용자 정의 커널을 구현하여 교차 엔트로피 계산을 위한 전역 메모리 소비를 무시할 수 있게 했습니다. 이것은 극적인 효과를 가져왔습니다. 예를 들어 Gemma 2 (2B) 모델을 살펴보면, CCE는 손실 계산의 메모리 풋프린트를 24 GB에서 1 MB로 줄이고, 분류기 헤드의 총 훈련 시간 메모리 소비를 28 GB에서 1 GB로 줄였습니다. CCE의 처리량을 향상시키기 위해 소프트맥스의 본질적인 희소성을 활용하고, 그레이디언트 계산의 요소 중 기여가 무시할 만큼 작은(즉, 수치적 정밀도 아래) 요소를 건너뛰는 것을 제안합니다. 실험 결과, 메모리 소비의 극적인 감소가 훈련 속도나 수렴을 희생하지 않고 달성되었음을 보여줍니다.

English

As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. Rather, CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly. We implement a custom kernel that performs the matrix multiplications and the log-sum-exp reduction over the vocabulary in flash memory, making global memory consumption for the cross-entropy computation negligible. This has a dramatic effect. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. To improve the throughput of CCE, we leverage the inherent sparsity of softmax and propose to skip elements of the gradient computation that have a negligible (i.e., below numerical precision) contribution to the gradient. Experiments demonstrate that the dramatic reduction in memory consumption is accomplished without sacrificing training speed or convergence.

대용량 언어 모델에서 손실을 줄이세요.

Cut Your Losses in Large-Vocabulary Language Models

초록

Summary

Support