ChatPaper.aiChatPaper

RSQ:从关键令牌中学习助力更优量化大语言模型

RSQ: Learning from Important Tokens Leads to Better Quantized LLMs

March 3, 2025
作者: Yi-Lin Sung, Prateek Yadav, Jialu Li, Jaehong Yoon, Mohit Bansal
cs.AI

摘要

分层量化是高效压缩大型模型而无需昂贵重训练的关键技术。以往的方法通常通过“均匀”优化所有输出token的层重建损失来量化每一层的权重。然而,本文中我们证明,通过优先从重要token(例如具有较大注意力分数的token)中学习,可以获得量化效果更优的模型。基于这一发现,我们提出了RSQ(旋转、缩放、再量化)方法,该方法(1)对模型应用旋转(正交变换)以缓解异常值(具有异常大数值的token)的影响,(2)根据token的重要性缩放其特征,以及(3)使用GPTQ框架并基于缩放token计算的二阶统计量对模型进行量化。为了计算token的重要性,我们探索了启发式和动态策略。通过对所有方法的深入分析,我们采用注意力集中度(即使用每个token的注意力分数作为其重要性)作为最佳方法。我们证明,RSQ在多个下游任务和三个模型家族(LLaMA3、Mistral和Qwen2.5)中均优于基线方法。此外,使用RSQ量化的模型在长上下文任务中表现出色,进一步凸显了其有效性。最后,RSQ在不同设置下(包括不同模型大小、校准数据集、比特精度和量化方法)均展现出良好的泛化能力。
English
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better-quantized models can be obtained by prioritizing learning from important tokens (e.g. which have large attention scores). Building on this finding, we propose RSQ (Rotate, Scale, then Quantize), which (1) applies rotations (orthogonal transformation) to the model to mitigate outliers (those with exceptionally large magnitude), (2) scales the token feature based on its importance, and (3) quantizes the model using the GPTQ framework with the second-order statistics computed by scaled tokens. To compute token importance, we explore both heuristic and dynamic strategies. Based on a thorough analysis of all approaches, we adopt attention concentration, which uses attention scores of each token as its importance, as the best approach. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families: LLaMA3, Mistral, and Qwen2.5. Additionally, models quantized with RSQ achieve superior performance on long-context tasks, further highlighting its effectiveness. Lastly, RSQ demonstrates generalizability across various setups, including different model sizes, calibration datasets, bit precisions, and quantization methods.

Summary

AI-Generated Summary

PDF23March 5, 2025