测试时间偏好优化：通过迭代文本反馈实现即时对齐

摘要

大型语言模型（LLMs）展示了令人印象深刻的性能，但缺乏快速适应人类偏好而无需重新训练的灵活性。在这项工作中，我们引入了测试时偏好优化（TPO）框架，该框架在推理过程中将LLM的输出与人类偏好对齐，无需更新模型参数。TPO不依赖纯数值奖励，而是将奖励信号转化为文本评论，并将其用作文本奖励，以迭代方式完善其响应。在涵盖指令遵循、偏好对齐、安全性和数学等基准测试上的评估显示，TPO逐渐改善了与人类偏好的对齐。值得注意的是，在经过几个TPO步骤后，最初未对齐的Llama-3.1-70B-SFT模型可以超越对齐的对应模型Llama-3.1-70B-Instruct。此外，TPO在推理过程中与搜索宽度和深度的扩展效率高。通过案例研究，我们阐明了TPO如何利用LLM解释和执行奖励信号的内在能力。我们的研究结果将TPO确立为测试时偏好优化的实用、轻量级替代方案，实现了即时对齐。我们的代码可在https://github.com/yafuly/TPO 上公开获取。

English

Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at https://github.com/yafuly/TPO.

测试时间偏好优化：通过迭代文本反馈实现即时对齐

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

摘要

Summary

Support

Support