저 비트 양자화는 미훈련된 LLMs에 유리합니다: 100조 개의 훈련 토큰을 사용한 양자화된 LLMs의 스케일링 법칙

초록

저희는 저비트 양자화가 훈련이 충분히 이루어지지 않은 대규모 언어 모델(Large Language Models, LLMs)에 유리하다는 것을 밝혀냅니다. 더 큰 크기나 적은 훈련 토큰을 갖는 모델은 저비트 양자화를 적용할 때 양자화로 인한 저하(Quantization-Induced Degradation, QiD)가 적고, 훈련 토큰이 많은 작은 모델은 상당한 QiD를 겪습니다. 이러한 추세에 대해 더 심층적인 통찰을 얻기 위해 우리는 다양한 크기와 훈련 수준(훈련이 충분히 이루어지지 않은 경우 또는 완전히 훈련된 경우)의 1500개 이상의 양자화된 LLM 체크포인트를 연구하였습니다. 이를 통해 훈련 토큰 수, 모델 크기, 비트 폭과 같은 요소들과 QiD 간의 관계를 이해하기 위한 스케일링 법칙을 도출했습니다. 도출된 스케일링 법칙을 통해, LLM의 훈련 수준을 측정하고 다양한 크기의 LLM을 완전히 훈련시키기 위해 필요한 훈련 토큰 수를 결정하는 노벨한 관점을 제안합니다. 게다가, 우리는 스케일링 법칙을 사용하여 100조 개의 토큰으로 훈련된 다양한 크기의 LLM의 양자화 성능을 예측합니다. 우리의 예측에 따르면, 미래 모델의 저비트 양자화 성능은 100조 개 이상의 토큰으로 훈련될 것으로 예상되는 모델에 대해 바람직하지 않을 수 있습니다. 이는 미래의 저비트 양자화에 대한 잠재적인 도전을 제기하며, 저비트 양자화 연구를 평가할 때 모델의 훈련 수준을 인식하는 필요성을 강조합니다. 이 문제에 대한 미래 연구를 촉진하기 위해, 본 연구에서 사용된 1500개 이상의 양자화된 체크포인트를 모두 https://huggingface.co/Xu-Ouyang 에 공개합니다.

English

We reveal that low-bit quantization favors undertrained large language models (LLMs) by observing that models with larger sizes or fewer training tokens experience less quantization-induced degradation (QiD) when applying low-bit quantization, whereas smaller models with extensive training tokens suffer significant QiD. To gain deeper insights into this trend, we study over 1500 quantized LLM checkpoints of various sizes and at different training levels (undertrained or fully trained) in a controlled setting, deriving scaling laws for understanding the relationship between QiD and factors such as the number of training tokens, model size and bit width. With the derived scaling laws, we propose a novel perspective that we can use QiD to measure an LLM's training levels and determine the number of training tokens required for fully training LLMs of various sizes. Moreover, we use the scaling laws to predict the quantization performance of different-sized LLMs trained with 100 trillion tokens. Our projection shows that the low-bit quantization performance of future models, which are expected to be trained with over 100 trillion tokens, may NOT be desirable. This poses a potential challenge for low-bit quantization in the future and highlights the need for awareness of a model's training level when evaluating low-bit quantization research. To facilitate future research on this problem, we release all the 1500+ quantized checkpoints used in this work at https://huggingface.co/Xu-Ouyang.

저 비트 양자화는 미훈련된 LLMs에 유리합니다: 100조 개의 훈련 토큰을 사용한 양자화된 LLMs의 스케일링 법칙

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

초록

Summary

Support