Reduzieren Sie Ihre Verluste bei Sprachmodellen mit großem Wortschatz.

papers.abstract

Mit dem stetigen Wachstum von Sprachmodellen wachsen auch ihre Vokabulare. Dies hat den Speicherbedarf von LLMs während des Trainings unverhältnismäßig auf eine einzige Schicht verschoben: die Kreuzentropie bei der Verlustberechnung. Die Kreuzentropie erstellt eine Logit-Matrix mit Einträgen für jedes Paar von Eingabetoken und Vokabularelementen und verbraucht für kleine Modelle eine Größenordnung mehr Speicher als der Rest des LLM zusammen. Wir schlagen Cut Cross-Entropy (CCE) vor, eine Methode, die den Kreuzentropieverlust berechnet, ohne die Logits für alle Token in den globalen Speicher zu materialisieren. Stattdessen berechnet CCE nur den Logit für das korrekte Token und bewertet die Log-Sum-Exp über alle Logits dynamisch. Wir implementieren einen benutzerdefinierten Kernel, der die Matrixmultiplikationen und die Log-Sum-Exp-Reduktion über das Vokabular im Flash-Speicher durchführt, wodurch der globale Speicherverbrauch für die Kreuzentropieberechnung vernachlässigbar wird. Dies hat einen dramatischen Effekt. Anhand des Gemma 2 (2B)-Modells reduziert CCE den Speicherbedarf der Verlustberechnung von 24 GB auf 1 MB und den Gesamtspeicherverbrauch des Klassifikatorkopfes während des Trainings von 28 GB auf 1 GB. Um die Leistungsfähigkeit von CCE zu verbessern, nutzen wir die inhärente Sparsamkeit von Softmax aus und schlagen vor, Elemente der Gradientenberechnung zu überspringen, die einen vernachlässigbaren (d. h. unterhalb der numerischen Präzision liegenden) Beitrag zum Gradienten haben. Experimente zeigen, dass die dramatische Reduzierung des Speicherverbrauchs ohne Einbußen bei der Trainingsgeschwindigkeit oder Konvergenz erreicht wird.

English

As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. Rather, CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly. We implement a custom kernel that performs the matrix multiplications and the log-sum-exp reduction over the vocabulary in flash memory, making global memory consumption for the cross-entropy computation negligible. This has a dramatic effect. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. To improve the throughput of CCE, we leverage the inherent sparsity of softmax and propose to skip elements of the gradient computation that have a negligible (i.e., below numerical precision) contribution to the gradient. Experiments demonstrate that the dramatic reduction in memory consumption is accomplished without sacrificing training speed or convergence.

Reduzieren Sie Ihre Verluste bei Sprachmodellen mit großem Wortschatz.

Cut Your Losses in Large-Vocabulary Language Models

papers.abstract

Support