NeuZip: Geheugenefficiënte Training en Inferentie met Dynamische Compressie van Neurale Netwerken

Samenvatting

De prestaties van neurale netwerken verbeteren wanneer er meer parameters worden gebruikt. Echter, de modelgroottes worden beperkt door het beschikbare geheugen op het apparaat tijdens training en inferentie. Hoewel technieken zoals kwantisatie de beperking kunnen verlichten, lijden ze aan prestatievermindering. In dit werk introduceren we NeuZip, een nieuw gewichtscompressieschema gebaseerd op de entropie van floating-point getallen in neurale netwerken. Met NeuZip zijn we in staat om geheugenefficiënte training en inferentie te bereiken zonder prestaties op te offeren. Opmerkelijk is dat we het geheugenverbruik voor het trainen van een Llama-3 8B-model aanzienlijk verminderen van 31 GB naar minder dan 16 GB, terwijl we de trainingsdynamiek volledig ongewijzigd houden. Bij inferentie kan onze methode het geheugengebruik meer dan halveren, terwijl de prestaties bijna verliesloos blijven. Onze code is openbaar beschikbaar.

English

The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.

NeuZip: Geheugenefficiënte Training en Inferentie met Dynamische Compressie van Neurale Netwerken

NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Samenvatting

Support