ChatPaper.aiChatPaper

低比特量化有利于未充分训练的LLM:具有100T训练标记的量化LLM的缩放定律

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

November 26, 2024
作者: Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, Dong Yu
cs.AI

摘要

我们发现,低比特量化有利于未充分训练的大型语言模型(LLMs),观察到具有更大尺寸或较少训练标记的模型在应用低比特量化时遭受的量化诱导退化(QiD)较少,而具有大量训练标记的较小模型遭受显著的QiD。为了更深入地了解这一趋势,我们在受控环境中研究了1500多个不同尺寸和不同训练水平(未充分训练或完全训练)的量化LLM检查点,推导出用于理解QiD与训练标记数量、模型尺寸和比特宽度等因素之间关系的标度律。 通过推导出的标度律,我们提出了一个新颖的观点,即我们可以利用QiD来衡量LLM的训练水平,并确定各种尺寸的LLM完全训练所需的训练标记数量。此外,我们利用这些标度律来预测使用100万亿标记进行训练的不同尺寸LLM的量化性能。我们的预测显示,未来模型的低比特量化性能,预计将使用超过100万亿标记进行训练,可能并不理想。这给未来的低比特量化带来了潜在挑战,并强调了在评估低比特量化研究时需要意识到模型的训练水平。为了促进未来研究解决这一问题,我们在https://huggingface.co/Xu-Ouyang 上发布了本研究中使用的所有1500多个量化检查点。
English
We reveal that low-bit quantization favors undertrained large language models (LLMs) by observing that models with larger sizes or fewer training tokens experience less quantization-induced degradation (QiD) when applying low-bit quantization, whereas smaller models with extensive training tokens suffer significant QiD. To gain deeper insights into this trend, we study over 1500 quantized LLM checkpoints of various sizes and at different training levels (undertrained or fully trained) in a controlled setting, deriving scaling laws for understanding the relationship between QiD and factors such as the number of training tokens, model size and bit width. With the derived scaling laws, we propose a novel perspective that we can use QiD to measure an LLM's training levels and determine the number of training tokens required for fully training LLMs of various sizes. Moreover, we use the scaling laws to predict the quantization performance of different-sized LLMs trained with 100 trillion tokens. Our projection shows that the low-bit quantization performance of future models, which are expected to be trained with over 100 trillion tokens, may NOT be desirable. This poses a potential challenge for low-bit quantization in the future and highlights the need for awareness of a model's training level when evaluating low-bit quantization research. To facilitate future research on this problem, we release all the 1500+ quantized checkpoints used in this work at https://huggingface.co/Xu-Ouyang.

Summary

AI-Generated Summary

PDF135November 27, 2024