浮動小数点量子化トレーニングのスケーリング則

要旨

低精度トレーニングは、トレーニングおよびその後の推論コストの両方を削減するための効果的な戦略と見なされています。従来の精度に関するスケーリング則は、主に整数量子化に焦点を当てており、浮動小数点量子化の構成要素にあまり注意を払っていないため、このシナリオでLLM損失に適切に適合していません。一方、浮動小数点量子化トレーニングは、一般的には実装されていますが、その研究は比較的浅いです。本論文では、LLMモデルの浮動小数点量子化ターゲット、指数ビット、仮数ビット、および浮動小数点量子化トレーニングのスケーリングファクターの計算粒度がパフォーマンスに与える影響を徹底的に探究します。正確な浮動小数点量子化統一スケーリング則を提示すると同時に、コミュニティに有益な提案を行います：(1) 指数ビットは仮数ビットよりもわずかにモデルのパフォーマンスに貢献します。異なるビット数に対して最適な指数-仮数ビット比率を提供し、ハードウェアメーカーが将来の参考資料として利用できます。(2) 低精度LLMトレーニングにおける臨界データサイズの形成を発見しました。臨界データサイズを超える過剰なトレーニングデータは、逆にLLMのパフォーマンスの低下をもたらします。(3) 最適な浮動小数点量子化精度は、計算能力に比例しますが、広範囲の計算能力範囲内では、最適なコストパフォーマンス精度は4〜8ビットの間にあると推定されます。

English

Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point quantization and thus cannot well fit the LLM losses in this scenario. In contrast, while floating-point quantization training is more commonly implemented in production, the research on it has been relatively superficial. In this paper, we thoroughly explore the effects of floating-point quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in floating-point quantization training performance of LLM models. While presenting an accurate floating-point quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers; (2) We discover the formation of the critical data size in low-precision LLM training. Too much training data exceeding the critical data size will inversely bring in degradation of LLM performance; (3) The optimal floating-point quantization precision is directly proportional to the computational power, but within a wide computational power range, we estimate that the best cost-performance precision lies between 4-8 bits.

浮動小数点量子化トレーニングのスケーリング則

Scaling Laws for Floating Point Quantization Training

要旨

Summary

Support