草案模型知道何时停止：自我验证长度策略用于推测解码

摘要

推测解码（SD）已成为加速大型语言模型推理速度的重要技术。传统的SD方法采用固定的草稿长度，忽略了跨任务的标记生成难度。因此，在本文中，我们解决了这一问题，并引入了SVIP - 一种针对推测解码系统的基于困难感知的动态草稿长度策略。基于草稿标记接受率的理论下限及其推理时间近似，SVIP根据每个草稿标记分布的熵自适应地确定草稿序列的长度。对主流SD基准和框架的实验结果表明，SVIP的性能优越，相比基线SD方法在SpecBench上实现了高达20\%的墙时加速，并在长篇生成长达8K标记的MT-Bench上实现了60\%的加速。此外，SVIP完全无需训练，与任何生成草稿标记的自回归SD方法兼容。实验结果还表明，SVIP在GliDe & CaPE和EAGLE-2的基础上持续提升墙时性能。

English

Speculative Decoding (SD) has become an important technique in accelerating the inference speed of large language models. Conventional SD methods employ a fixed draft length, which ignores the token generation difficulty across tasks. Consequently, in this paper, we address such an issue and introduce SVIP - a difficulty-aware dynamic draft length policy for speculative decoding systems. Based on a theoretical lower bound of draft token acceptance rate and its inference-time approximation, SVIP adaptively determines the lengths of draft sequences based on the entropy of each draft token distribution. Experimental results on mainstream SD benchmarks and frameworks demonstrate the superior performance of SVIP, achieving up to 20\% walltime speedup on SpecBench over baseline SD methods and 60\% speedup on MT-Bench for long-form generation of up to 8K tokens. Moreover, SVIP is totally training-free and compatible with any existing SD methods that generate draft tokens autoregressively. Experimental results also show that SVIP yields consistent walltime improvement on top of GliDe & CaPE and EAGLE-2.

草案模型知道何时停止：自我验证长度策略用于推测解码

Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding

摘要

Summary

Support

Support