通過潛在偏好優化的自適應解碼

摘要

在語言模型解碼過程中，使用較高溫度採樣會產生更具創意的回應，而較低溫度則更加準確。然而，這些模型通常應用於一般指令遵循，其中包含創意和事實尋求任務，並在所有範例和標記中使用單一固定溫度。在這項工作中，我們引入了自適應解碼，這是一個添加到模型中的層，可在推論時動態選擇採樣溫度，無論是在標記還是範例級別，以優化性能。為了學習其參數，我們引入了潛在偏好優化（LPO），這是一種訓練離散潛在變量（如溫度選擇）的通用方法。我們的方法在需要不同溫度的一系列任務中優於所有固定解碼溫度，包括UltraFeedback、創意故事寫作和GSM8K。

English

During language model decoding, it is known that using higher temperature sampling gives more creative responses, while lower temperatures are more factually accurate. However, such models are commonly applied to general instruction following, which involves both creative and fact seeking tasks, using a single fixed temperature across all examples and tokens. In this work, we introduce Adaptive Decoding, a layer added to the model to select the sampling temperature dynamically at inference time, at either the token or example level, in order to optimize performance. To learn its parameters we introduce Latent Preference Optimization (LPO) a general approach to train discrete latent variables such as choices of temperature. Our method outperforms all fixed decoding temperatures across a range of tasks that require different temperatures, including UltraFeedback, Creative Story Writing, and GSM8K.

通過潛在偏好優化的自適應解碼

Adaptive Decoding via Latent Preference Optimization

摘要

Summary

Support