采样而非搜索：重新思考语言模型的测试时对齐策略

摘要

增加测试时计算量已成为提升语言模型性能的一个有前景的方向，尤其是在因计算资源限制或模型权重私有化而无法进行微调的场景下。然而，现有的基于奖励模型（RM）的测试时搜索方法，随着计算规模的扩大，往往会因过度优化本质上不完美的奖励代理而导致质量下降。我们引入了QAlign，一种新的测试时对齐方法。随着测试时计算量的增加，QAlign会收敛到为每个提示从最优对齐分布中采样。通过采用文本生成领域最新的马尔可夫链蒙特卡洛技术，我们的方法能够在无需修改底层模型甚至无需访问logits的情况下，生成更对齐的输出。我们在数学推理基准测试（GSM8K和GSM-Symbolic）上使用任务特定的RM验证了QAlign的有效性，展示了其相较于现有测试时计算方法（如最佳n选一和多数投票）的持续改进。此外，当结合基于Tulu 3偏好数据集训练的更为现实的RM应用时，QAlign在多种数据集（GSM8K、MATH500、IFEval、MMLU-Redux和TruthfulQA）上均优于直接偏好优化（DPO）、最佳n选一、多数投票及加权多数投票。作为一种在测试时利用额外计算量对齐语言模型且不导致性能下降的实用解决方案，我们的方法拓展了无需进一步训练即可从现成语言模型中获取能力的极限。

English

Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

采样而非搜索：重新思考语言模型的测试时对齐策略

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

摘要

Summary

Support

Support