通過推測性拒絕實現快速的前N解碼
Fast Best-of-N Decoding via Speculative Rejection
October 26, 2024
作者: Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, Andrea Zanette
cs.AI
摘要
大型語言模型(LLMs)的安全有效部署包括一個關鍵步驟,稱為對齊,確保模型的回應符合人類偏好。流行的對齊技術,如DPO、PPO及其變體,通過在後訓練階段改變預訓練模型權重來對齊LLMs。儘管主導地位,這些後訓練方法在LLMs部署前增加了相當複雜性。推論時對齊方法避免了複雜的後訓練步驟,而是將生成偏向符合人類偏好的回應。最著名的推論時對齊方法稱為Best-of-N,與最先進的後訓練程序一樣有效。不幸的是,Best-of-N在推論時需要比標準解碼策略更多的資源,這使其在計算上不可行。在這項工作中,我們介紹了一種計算上可行的推論時對齊算法,稱為Speculative Rejection。它根據給定的獎勵模型生成高分回應,就像Best-of-N一樣,同時在計算效率上更高達16至32倍。
English
The safe and effective deployment of Large Language Models (LLMs) involves a
critical step called alignment, which ensures that the model's responses are in
accordance with human preferences. Prevalent alignment techniques, such as DPO,
PPO and their variants, align LLMs by changing the pre-trained model weights
during a phase called post-training. While predominant, these post-training
methods add substantial complexity before LLMs can be deployed. Inference-time
alignment methods avoid the complex post-training step and instead bias the
generation towards responses that are aligned with human preferences. The
best-known inference-time alignment method, called Best-of-N, is as effective
as the state-of-the-art post-training procedures. Unfortunately, Best-of-N
requires vastly more resources at inference time than standard decoding
strategies, which makes it computationally not viable. In this work, we
introduce Speculative Rejection, a computationally-viable inference-time
alignment algorithm. It generates high-scoring responses according to a given
reward model, like Best-of-N does, while being between 16 to 32 times more
computationally efficient.Summary
AI-Generated Summary