量化大型语言模型用于代码生成：一项差异化复现研究

摘要

大型语言模型（LLMs）在代码生成方面展现了令人瞩目的能力，尤其是在自动实现自然语言描述的需求方面。LLM的效能通常随其规模增长而提升：可训练参数越多，其代码实现能力越强。然而，在部署基于LLM的代码生成器时，更大的LLM带来了显著的内存（及相应的碳）足迹挑战。Wei等人先前的研究提出利用量化技术来减少基于LLM的代码生成器的内存占用，而不显著降低其效能。简而言之，他们研究了参数高达160亿的LLM，将其精度从32位浮点数降至8位整数，并展示了这对代码生成性能的有限影响。鉴于LLM能力和量化技术快速发展的现状，本研究对Wei等人的工作进行了差异化复现，我们考虑了：（i）一方面，更新、更大的代码相关LLM，参数规模高达340亿；（ii）模型量化技术的最新进展，允许将压缩推向每个模型参数仅2比特的极端量化水平；以及（iii）指导量化过程的不同类型校准数据集，包括专门针对代码的数据集。我们的实证评估表明，LLM量化的新前沿是4比特精度，相比原始模型平均减少了70%的内存占用，且未观察到性能的显著下降。此外，当量化更为极端（3比特和2比特）时，使用代码特定的校准数据集有助于限制性能损失。

English

Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code. However, when it comes to deploying LLM-based code generators, larger LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. A previous work by Wei et al. proposed to leverage quantization techniques to reduce the memory footprint of LLM-based code generators without substantially degrading their effectiveness. In short, they studied LLMs featuring up to 16B parameters, quantizing their precision from floating point 32 bits down to int 8 bits and showing their limited impact on code generation performance. Given the fast pace at which LLM capabilities and quantization techniques are evolving, in this work we present a differentiated replication of the work by Wei et al. in which we consider (i) on the one side, more recent and larger code-related LLMs, of up to 34B parameters; (ii) the latest advancements in model quantization techniques, which allow pushing the compression to the extreme quantization level of 2 bits per model parameter and; (iii) different types of calibration datasets to guide the quantization process, including code-specific ones. Our empirical evaluation reveals that the new frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70% compared to the original model without observing any significant decrease in performance. Additionally, when the quantization becomes even more extreme (3 and 2 bits), a code-specific calibration dataset helps to limit the loss of performance.

量化大型语言模型用于代码生成：一项差异化复现研究

Quantizing Large Language Models for Code Generation: A Differentiated Replication

摘要

Summary

Support