双解码：基于硬件感知的异构推测解码与动态多序列草稿生成

摘要

大型语言模型（LLMs）在众多任务中展现出卓越性能；然而，其逐令牌自回归生成过程显著拖慢了推理速度。推测式解码提出了一种颇具前景的“先草稿后验证”框架，能在保持输出分布保真度的同时降低生成延迟。尽管如此，草稿模型引入了额外的计算开销，成为性能瓶颈并延长了首令牌生成时间（TTFT）。以往缓解草稿模型开销的方法主要依赖启发式策略，通常难以匹配草稿语言模型的质量。针对这些挑战，我们提出了DuoDecoding，一种新颖的方法，策略性地将草稿模型和目标模型分别部署于CPU和GPU上，实现并行解码的同时保持草稿质量。我们的方法融合了硬件感知的最优草稿预算以最小化空闲时间，并采用动态多序列草稿生成来提升草稿质量。在七项任务上的广泛实验表明，DuoDecoding在生成延迟上实现了最高2.61倍的加速，同时将TTFT降至传统推测式解码的83%。代码已发布于https://github.com/KaiLv69/DuoDecoding。

English

Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity. Nevertheless, the draft model introduces additional computational overhead, becoming a performance bottleneck and increasing the time to first token (TTFT). Previous approaches to mitigate draft model overhead have primarily relied on heuristics and generally failed to match the quality of the draft language models. To address these challenges, we propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively, enabling parallel decoding while preserving draft quality. Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality. Extensive experiments across seven tasks show that DuoDecoding achieves up to 2.61x speedup in generation latency, while reducing TTFT to 83% of that in conventional speculative decoding. The Code is available at https://github.com/KaiLv69/DuoDecoding.

双解码：基于硬件感知的异构推测解码与动态多序列草稿生成

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

摘要

Summary

Support

Support