量化會損害推理能力嗎？量化推理模型的實證研究

摘要

近期，推理語言模型的進展在複雜任務中展現了卓越的性能，但其延長的思維鏈推理過程增加了推理開銷。雖然量化技術已被廣泛應用於降低大型語言模型的推理成本，但其對推理模型的影響仍缺乏深入研究。在本研究中，我們首次對量化推理模型進行了系統性研究，評估了開源的DeepSeek-R1-Distilled Qwen和LLaMA系列模型（參數量從1.5B到70B）以及QwQ-32B模型。我們的研究涵蓋了權重、KV緩存和激活量化，使用了不同比特寬度的最先進算法，並在數學（AIME、MATH-500）、科學（GPQA）和編程（LiveCodeBench）推理基準上進行了廣泛評估。我們的研究結果表明，雖然W8A8或W4A16量化可以實現無損量化，但更低的比特寬度會引入顯著的準確性風險。我們進一步發現模型大小、模型來源和任務難度是性能的關鍵決定因素。與預期相反，量化模型並未表現出輸出長度增加的情況。此外，策略性地擴展模型大小或推理步驟可以有效提升性能。所有量化模型和代碼將開源於https://github.com/ruikangliu/Quantized-Reasoning-Models。

English

Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this study, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, and QwQ-32B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes will be open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.

量化會損害推理能力嗎？量化推理模型的實證研究

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

摘要

Summary

Support

Support