量化影响推理能力吗？量化推理模型的实证研究

摘要

近期，推理语言模型在复杂任务中展现出了卓越的性能，但其延长的思维链推理过程增加了推理开销。尽管量化技术已被广泛用于降低大型语言模型的推理成本，其对推理模型的影响仍缺乏深入研究。在本研究中，我们首次对量化推理模型进行了系统性研究，评估了开源模型DeepSeek-R1-Distilled Qwen和LLaMA系列（参数规模从1.5B到70B）以及QwQ-32B。我们的研究涵盖了权重、KV缓存和激活量化，采用不同比特宽度的先进算法，并在数学（AIME、MATH-500）、科学（GPQA）和编程（LiveCodeBench）推理基准上进行了广泛评估。研究发现，虽然W8A8或W4A16量化可实现无损量化，但更低比特宽度会带来显著的准确率风险。我们进一步发现，模型大小、模型来源和任务难度是性能的关键决定因素。与预期相反，量化模型并未表现出输出长度增加的现象。此外，策略性地扩展模型规模或增加推理步骤能有效提升性能。所有量化模型及代码将开源至https://github.com/ruikangliu/Quantized-Reasoning-Models。

English

Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this study, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, and QwQ-32B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes will be open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.

量化影响推理能力吗？量化推理模型的实证研究

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

摘要

Summary

Support

Support