采样而非搜索:重新思考语言模型的测试时对齐策略
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models
April 4, 2025
作者: Gonçalo Faria, Noah A. Smith
cs.AI
摘要
增加测试时计算量已成为提升语言模型性能的一个有前景的方向,尤其是在因计算资源限制或模型权重私有化而无法进行微调的场景下。然而,现有的基于奖励模型(RM)的测试时搜索方法,随着计算规模的扩大,往往会因过度优化本质上不完美的奖励代理而导致质量下降。我们引入了QAlign,一种新的测试时对齐方法。随着测试时计算量的增加,QAlign会收敛到为每个提示从最优对齐分布中采样。通过采用文本生成领域最新的马尔可夫链蒙特卡洛技术,我们的方法能够在无需修改底层模型甚至无需访问logits的情况下,生成更对齐的输出。我们在数学推理基准测试(GSM8K和GSM-Symbolic)上使用任务特定的RM验证了QAlign的有效性,展示了其相较于现有测试时计算方法(如最佳n选一和多数投票)的持续改进。此外,当结合基于Tulu 3偏好数据集训练的更为现实的RM应用时,QAlign在多种数据集(GSM8K、MATH500、IFEval、MMLU-Redux和TruthfulQA)上均优于直接偏好优化(DPO)、最佳n选一、多数投票及加权多数投票。作为一种在测试时利用额外计算量对齐语言模型且不导致性能下降的实用解决方案,我们的方法拓展了无需进一步训练即可从现成语言模型中获取能力的极限。
English
Increasing test-time computation has emerged as a promising direction for
improving language model performance, particularly in scenarios where model
finetuning is impractical or impossible due to computational constraints or
private model weights. However, existing test-time search methods using a
reward model (RM) often degrade in quality as compute scales, due to the
over-optimization of what are inherently imperfect reward proxies. We introduce
QAlign, a new test-time alignment approach. As we scale test-time compute,
QAlign converges to sampling from the optimal aligned distribution for each
individual prompt. By adopting recent advances in Markov chain Monte Carlo for
text generation, our method enables better-aligned outputs without modifying
the underlying model or even requiring logit access. We demonstrate the
effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and
GSM-Symbolic) using a task-specific RM, showing consistent improvements over
existing test-time compute methods like best-of-n and majority voting.
Furthermore, when applied with more realistic RMs trained on the Tulu 3
preference dataset, QAlign outperforms direct preference optimization (DPO),
best-of-n, majority voting, and weighted majority voting on a diverse range of
datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical
solution to aligning language models at test time using additional computation
without degradation, our approach expands the limits of the capability that can
be obtained from off-the-shelf language models without further training.Summary
AI-Generated Summary