通過推測性拒絕實現快速的前N解碼

摘要

大型語言模型（LLMs）的安全有效部署包括一個關鍵步驟，稱為對齊，確保模型的回應符合人類偏好。流行的對齊技術，如DPO、PPO及其變體，通過在後訓練階段改變預訓練模型權重來對齊LLMs。儘管主導地位，這些後訓練方法在LLMs部署前增加了相當複雜性。推論時對齊方法避免了複雜的後訓練步驟，而是將生成偏向符合人類偏好的回應。最著名的推論時對齊方法稱為Best-of-N，與最先進的後訓練程序一樣有效。不幸的是，Best-of-N在推論時需要比標準解碼策略更多的資源，這使其在計算上不可行。在這項工作中，我們介紹了一種計算上可行的推論時對齊算法，稱為Speculative Rejection。它根據給定的獎勵模型生成高分回應，就像Best-of-N一樣，同時在計算效率上更高達16至32倍。

English

The safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model's responses are in accordance with human preferences. Prevalent alignment techniques, such as DPO, PPO and their variants, align LLMs by changing the pre-trained model weights during a phase called post-training. While predominant, these post-training methods add substantial complexity before LLMs can be deployed. Inference-time alignment methods avoid the complex post-training step and instead bias the generation towards responses that are aligned with human preferences. The best-known inference-time alignment method, called Best-of-N, is as effective as the state-of-the-art post-training procedures. Unfortunately, Best-of-N requires vastly more resources at inference time than standard decoding strategies, which makes it computationally not viable. In this work, we introduce Speculative Rejection, a computationally-viable inference-time alignment algorithm. It generates high-scoring responses according to a given reward model, like Best-of-N does, while being between 16 to 32 times more computationally efficient.

通過推測性拒絕實現快速的前N解碼

Fast Best-of-N Decoding via Speculative Rejection

摘要

Summary

Support

Support