草擬模型知道何時停止：自我驗證長度策略用於推理解碼

摘要

推測解碼（SD）已成為加速大型語言模型推理速度的重要技術。傳統的SD方法使用固定的草稿長度，忽略了跨任務的標記生成難度。因此，在本文中，我們解決了這個問題，並引入了SVIP - 一種針對推測解碼系統的難度感知動態草稿長度策略。基於草稿標記接受率的理論下限及其推理時間近似，SVIP根據每個草稿標記分佈的熵自適應地確定草稿序列的長度。對主流SD基準和框架的實驗結果顯示，SVIP的性能優越，相較於基準SD方法，在SpecBench上實現高達20\%的牆時速度提升，在長達8K標記的MT-Bench上實現60\%的速度提升。此外，SVIP完全無需訓練，與任何生成草稿標記的自回歸SD方法兼容。實驗結果還表明，SVIP在GliDe＆CaPE和EAGLE-2的基礎上持續提高牆時性能。

English

Speculative Decoding (SD) has become an important technique in accelerating the inference speed of large language models. Conventional SD methods employ a fixed draft length, which ignores the token generation difficulty across tasks. Consequently, in this paper, we address such an issue and introduce SVIP - a difficulty-aware dynamic draft length policy for speculative decoding systems. Based on a theoretical lower bound of draft token acceptance rate and its inference-time approximation, SVIP adaptively determines the lengths of draft sequences based on the entropy of each draft token distribution. Experimental results on mainstream SD benchmarks and frameworks demonstrate the superior performance of SVIP, achieving up to 20\% walltime speedup on SpecBench over baseline SD methods and 60\% speedup on MT-Bench for long-form generation of up to 8K tokens. Moreover, SVIP is totally training-free and compatible with any existing SD methods that generate draft tokens autoregressively. Experimental results also show that SVIP yields consistent walltime improvement on top of GliDe & CaPE and EAGLE-2.

草擬模型知道何時停止：自我驗證長度策略用於推理解碼

Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding

摘要

Support