基于推测解码中的时间局部性的分层起草，实现大型语言模型的无损加速。

摘要

在大型语言模型（LLMs）中加速推理对于实时交互至关重要，因为它们已被广泛整合到现实世界的服务中。猜测解码作为一种完全算法化的解决方案，因通过起草和验证标记来提高推理速度而备受关注，从而在单次前向传递中生成多个标记。然而，当前的起草策略通常需要进行大量微调，或者在各种任务中性能不一致。为解决这些挑战，我们提出了层次起草（HD），这是一种基于时间局部性构建多个数据库的层次框架的新型无损起草方法。在起草步骤中，HD按照从最高到最低局部性的顺序顺序访问多个数据库，以确保在各种任务中实现一致的加速，并最小化起草延迟。我们在使用具有7B和13B参数的LLMs的Spec-Bench上进行的实验表明，HD优于现有的数据库起草方法，在模型大小、任务和温度上实现了稳健的推理加速。

English

Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.

基于推测解码中的时间局部性的分层起草，实现大型语言模型的无损加速。

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

摘要

Summary

Support