Jakiro: MoE를 통한 분리된 다중 헤드를 활용한 추론 디코딩 강화

초록

추론적 디코딩(SD)은 대형 언어 모델 추론을 가속화하기 위해 작은 초안 모델을 사용하여 여러 토큰을 예측하고, 그 후 더 큰 대상 모델에 의해 병렬로 확인됩니다. 그러나 초안 모델의 제한된 용량으로 인해 예측 정확도를 향상시키기 위해 트리 기반 샘플링이 종종 필요합니다. 여기서 한 가지 주요 제한 사항을 확인했습니다: 동일한 단계의 후보자들은 동일한 표현에서 파생되어 다양성을 제한하고 전반적인 효과를 감소시킵니다. 이를 해결하기 위해 Mixture of Experts (MoE)를 활용하는 Jakiro를 제안합니다. 독립 전문가들이 다양한 예측을 생성하여 후보자들 사이의 상관 관계를 효과적으로 분리합니다. 더 나아가 초기 토큰에 대한 자기회귀 디코딩과 후속 단계에 대한 병렬 디코딩을 결합하는 하이브리드 추론 전략을 소개하고, 후자를 정확도를 향상시키기 위해 특징에 대한 대조 메커니즘으로 강화합니다. 우리의 방법은 예측 정확도를 크게 향상시키고 더 높은 추론 가속을 달성합니다. 다양한 모델을 대상으로 한 포괄적인 실험은 우리의 접근 방식의 효과성과 견고성을 검증하며, 추론적 디코딩의 새로운 SOTA를 확립합니다. 우리의 코드는 https://github.com/haiduo/Jakiro에서 사용할 수 있습니다.

English

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens, which are then verified in parallel by the larger target model. However, the limited capacity of the draft model often necessitates tree-based sampling to improve prediction accuracy, where multiple candidates are generated at each step. We identify a key limitation in this approach: the candidates at the same step are derived from the same representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions, effectively decoupling correlations among candidates. Furthermore, we introduce a hybrid inference strategy, combining autoregressive decoding for initial tokens with parallel decoding for subsequent stages, and enhance the latter with contrastive mechanism in features to improve accuracy. Our method significantly boosts prediction accuracy and achieves higher inference speedups. Extensive experiments across diverse models validate the effectiveness and robustness of our approach, establishing a new SOTA in speculative decoding. Our codes are available at https://github.com/haiduo/Jakiro.

Jakiro: MoE를 통한 분리된 다중 헤드를 활용한 추론 디코딩 강화

Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE

초록

Support