시간적 지역성을 기반으로 한 계층적 초안 작성을 활용한 대규모 언어 모델의 손실 없는 가속화

초록

대규모 언어 모델(LLMs)에서 추론 가속화는 실시간 상호작용에 중요한데, 이들은 실제 서비스에 널리 통합되어 있습니다. 추측 디코딩은 완전히 알고리즘적인 해결책으로, 토큰을 초안 작성하고 확인하여 단일 전방향 패스에서 여러 토큰을 생성함으로써 추론 속도를 향상시키는 데 주목받고 있습니다. 그러나 현재의 초안 작성 전략은 일반적으로 상당한 세밀한 조정이 필요하거나 작업 간에 일관된 성능을 보여주지 않습니다. 이러한 도전에 대응하기 위해 우리는 계층적 초안 작성(HD)을 제안합니다. 이는 시간적 국소성을 기반으로 다양한 토큰 소스를 여러 데이터베이스로 구성하는 손실 없는 초안 작성 방법입니다. 초안 작성 단계에서 HD는 가장 높은 지역성부터 가장 낮은 지역성까지 순차적으로 여러 데이터베이스에 액세스하여 초안 토큰을 획득함으로써 다양한 작업에 걸쳐 일관된 가속화를 보장하고 초안 작성 대기 시간을 최소화합니다. 7B 및 13B 매개변수를 가진 LLM을 사용하여 Spec-Bench에서 수행한 실험 결과, HD가 기존 데이터베이스 초안 작성 방법을 능가하며, 모델 크기, 작업 및 온도에 걸쳐 견고한 추론 가속을 달성함을 보여줍니다.

English

Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.

시간적 지역성을 기반으로 한 계층적 초안 작성을 활용한 대규모 언어 모델의 손실 없는 가속화

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

초록

Summary

Support