잠재 선호도 최적화를 통한 적응형 디코딩

초록

언어 모델 디코딩 중에는 높은 온도 샘플링을 사용하면 더 창의적인 응답이 생성되는 반면, 낮은 온도는 사실적인 정확도를 높입니다. 그러나 이러한 모델은 일반적으로 창의적인 작업과 사실 확인 작업을 모두 포함하는 일반적인 지시에 적용되며, 모든 예제와 토큰에 대해 단일 고정 온도를 사용합니다. 본 연구에서는 성능을 최적화하기 위해 추론 시 동적으로 샘플링 온도를 선택하는 모델에 추가된 층인 적응형 디코딩을 소개합니다. 매개변수를 학습하기 위해 선택된 온도와 같은 이산 잠재 변수를 훈련하는 일반적인 방법인 잠재 선호도 최적화(LPO)를 소개합니다. 우리의 방법은 UltraFeedback, 창의적인 이야기 작성 및 GSM8K를 포함한 다양한 온도가 필요한 작업 범위에서 모든 고정 디코딩 온도를 능가합니다.

English

During language model decoding, it is known that using higher temperature sampling gives more creative responses, while lower temperatures are more factually accurate. However, such models are commonly applied to general instruction following, which involves both creative and fact seeking tasks, using a single fixed temperature across all examples and tokens. In this work, we introduce Adaptive Decoding, a layer added to the model to select the sampling temperature dynamically at inference time, at either the token or example level, in order to optimize performance. To learn its parameters we introduce Latent Preference Optimization (LPO) a general approach to train discrete latent variables such as choices of temperature. Our method outperforms all fixed decoding temperatures across a range of tasks that require different temperatures, including UltraFeedback, Creative Story Writing, and GSM8K.

잠재 선호도 최적화를 통한 적응형 디코딩

Adaptive Decoding via Latent Preference Optimization

초록

Summary

Support