다중 초안 추정 샘플링: 규범적 아키텍처와 이론적 한계

초록

우리는 다중 초안 추정 샘플링을 고려합니다. 여기서 제안 시퀀스는 서로 다른 초안 모델에서 독립적으로 샘플링됩니다. 각 단계에서 토큰 수준의 초안 선택 체계는 유효한 토큰 목록을 입력으로 받아 대상 모델의 분포와 일치하는 출력 토큰을 생성합니다. 이전 연구들은 입력 토큰 중 하나를 수락할 확률을 최대화하는 최적 체계를 선형 프로그램의 해로 캐스팅할 수 있다는 것을 보여주었습니다. 본 연구에서는 최적 체계를 두 단계로 분해할 수 있다는 것을 보여줍니다: 첫 번째 단계에서 중요 샘플링(IS) 유형의 체계를 사용하여 중간 토큰 하나를 선택하고, 두 번째 단계에서 (단일 초안) 추정 샘플링을 적용하여 출력 토큰을 생성합니다. 두 개의 동일한 초안 모델의 경우, 수락 확률이 1이 되는 필요충분 조건을 설정하고 최적 수락 확률에 대한 명시적 표현을 제공합니다. 이론적 분석은 가중 중요 샘플링을 기반으로 한 새로운 클래스의 토큰 수준 선택 체계를 동기부여합니다. 실험 결과는 다양한 시나리오에서 기준 체계에 비해 달성 가능한 블록 효율성 및 토큰 속도가 일관되게 향상된 것을 보여줍니다.

English

We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models. At each step, a token-level draft selection scheme takes a list of valid tokens as input and produces an output token whose distribution matches that of the target model. Previous works have demonstrated that the optimal scheme (which maximizes the probability of accepting one of the input tokens) can be cast as a solution to a linear program. In this work we show that the optimal scheme can be decomposed into a two-step solution: in the first step an importance sampling (IS) type scheme is used to select one intermediate token; in the second step (single-draft) speculative sampling is applied to generate the output token. For the case of two identical draft models we further 1) establish a necessary and sufficient condition on the distributions of the target and draft models for the acceptance probability to equal one and 2) provide an explicit expression for the optimal acceptance probability. Our theoretical analysis also motives a new class of token-level selection scheme based on weighted importance sampling. Our experimental results demonstrate consistent improvements in the achievable block efficiency and token rates over baseline schemes in a number of scenarios.

다중 초안 추정 샘플링: 규범적 아키텍처와 이론적 한계

Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits

초록

Support