在大词汇量语言模型中减少损失

摘要

随着语言模型的不断扩大，它们的词汇量也在增加。这导致了在训练过程中，LLM 的内存占用不成比例地集中在一个单一层上：交叉熵在损失计算中的作用。交叉熵构建了一个逻辑矩阵，其中包含每对输入标记和词汇项的条目，对于较小的模型而言，其消耗的内存比LLM的其余部分加起来还多一个数量级。我们提出了Cut Cross-Entropy（CCE）方法，该方法在计算交叉熵损失时，不需要将所有标记的逻辑值实体化到全局内存中。相反，CCE 仅计算正确标记的逻辑值，并在计算过程中动态评估所有逻辑值的对数总和指数。我们实现了一个自定义内核，用于在闪存中执行矩阵乘法和词汇表中的对数总和指数减少，从而使交叉熵计算的全局内存消耗变得微不足道。这产生了戏剧性的效果。以 Gemma 2（2B）模型为例，CCE 将损失计算的内存占用从24 GB 减少到 1 MB，并将分类器头部的总训练时内存消耗从28 GB 减少到 1 GB。为了提高 CCE 的吞吐量，我们利用 softmax 的固有稀疏性，并建议跳过对梯度计算贡献微不足道（即低于数值精度）的元素。实验证明，在不牺牲训练速度或收敛性的情况下，成功实现了内存消耗的戏剧性减少。

English

As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. Rather, CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly. We implement a custom kernel that performs the matrix multiplications and the log-sum-exp reduction over the vocabulary in flash memory, making global memory consumption for the cross-entropy computation negligible. This has a dramatic effect. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. To improve the throughput of CCE, we leverage the inherent sparsity of softmax and propose to skip elements of the gradient computation that have a negligible (i.e., below numerical precision) contribution to the gradient. Experiments demonstrate that the dramatic reduction in memory consumption is accomplished without sacrificing training speed or convergence.

在大词汇量语言模型中减少损失

Cut Your Losses in Large-Vocabulary Language Models

摘要

Summary

Support