스펙큘러티브 거부를 통한 빠른 베스트-오브-N 디코딩

초록

대형 언어 모델 (LLM)의 안전하고 효과적인 배포는 인간의 선호에 부합하는 모델 응답을 보장하는 정렬이라는 중요한 단계를 포함합니다. DPO, PPO 및 그 변형과 같은 주요 정렬 기술은 사전 훈련된 모델 가중치를 변경하여 LLM을 정렬하는데, 이는 후 훈련이라는 단계에서 이루어집니다. 주요한 후 훈련 방법은 LLM을 배포하기 전에 상당한 복잡성을 추가합니다. 추론 시간 정렬 방법은 복잡한 후 훈련 단계를 피하고 대신 인간의 선호와 일치하는 응답으로 생성을 편향시킵니다. Best-of-N이라고 불리는 가장 잘 알려진 추론 시간 정렬 방법은 최첨단 후 훈련 절차만큼 효과적입니다. 유감스럽게도, Best-of-N은 표준 디코딩 전략보다 추론 시간에 훨씬 더 많은 리소스가 필요하여 계산적으로 실행 불가능합니다. 본 연구에서는 계산적으로 실행 가능한 추론 시간 정렬 알고리즘인 Speculative Rejection을 소개합니다. 이는 Best-of-N과 같이 주어진 보상 모델에 따라 높은 점수의 응답을 생성하면서 계산적으로 16배에서 32배 더 효율적입니다.

English

The safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model's responses are in accordance with human preferences. Prevalent alignment techniques, such as DPO, PPO and their variants, align LLMs by changing the pre-trained model weights during a phase called post-training. While predominant, these post-training methods add substantial complexity before LLMs can be deployed. Inference-time alignment methods avoid the complex post-training step and instead bias the generation towards responses that are aligned with human preferences. The best-known inference-time alignment method, called Best-of-N, is as effective as the state-of-the-art post-training procedures. Unfortunately, Best-of-N requires vastly more resources at inference time than standard decoding strategies, which makes it computationally not viable. In this work, we introduce Speculative Rejection, a computationally-viable inference-time alignment algorithm. It generates high-scoring responses according to a given reward model, like Best-of-N does, while being between 16 to 32 times more computationally efficient.

스펙큘러티브 거부를 통한 빠른 베스트-오브-N 디코딩

Fast Best-of-N Decoding via Speculative Rejection

초록

Summary

Support