迭代值函数优化的引导解码

摘要

尽管基于人类反馈的强化学习（RLHF）已成为控制语言模型输出的主流方法，但其存在计算成本高和训练不稳定的问题。引导解码，尤其是价值引导方法，提供了一种无需重新训练模型即可控制输出的经济高效替代方案。然而，价值函数的准确性对于价值引导解码至关重要，因为不准确可能导致决策次优化和性能下降。现有方法在准确估计最优价值函数方面存在困难，导致控制效果不佳。我们提出了迭代价值函数优化框架，通过两个关键组件解决这些局限：蒙特卡洛价值估计，通过探索多样轨迹减少估计方差；以及迭代策略优化，通过从价值引导策略中收集轨迹逐步改进价值估计。在文本摘要、多轮对话和指令跟随任务上的大量实验证明了价值引导解码方法在语言模型对齐中的有效性。这些方法不仅实现了对齐，还通过利用原则性价值函数优化进行高效控制，显著降低了计算成本。

English

While Reinforcement Learning from Human Feedback (RLHF) has become the predominant method for controlling language model outputs, it suffers from high computational costs and training instability. Guided decoding, especially value-guided methods, offers a cost-effective alternative by controlling outputs without re-training models. However, the accuracy of the value function is crucial for value-guided decoding, as inaccuracies can lead to suboptimal decision-making and degraded performance. Existing methods struggle with accurately estimating the optimal value function, leading to less effective control. We propose Iterative Value Function Optimization, a novel framework that addresses these limitations through two key components: Monte Carlo Value Estimation, which reduces estimation variance by exploring diverse trajectories, and Iterative On-Policy Optimization, which progressively improves value estimation through collecting trajectories from value-guided policies. Extensive experiments on text summarization, multi-turn dialogue, and instruction following demonstrate the effectiveness of value-guided decoding approaches in aligning language models. These approaches not only achieve alignment but also significantly reduce computational costs by leveraging principled value function optimization for efficient and effective control.

迭代值函数优化的引导解码

Iterative Value Function Optimization for Guided Decoding

摘要

Summary

Support

Support