ChatPaper.aiChatPaper

樣本,無需搜索:重新思考語言模型的測試時對齊

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

April 4, 2025
作者: Gonçalo Faria, Noah A. Smith
cs.AI

摘要

增加測試時的計算量已成為提升語言模型性能的一個有前景的方向,尤其是在模型微調因計算資源限制或模型權重私有化而不可行或無法實施的場景中。然而,現有的基於獎勵模型(RM)的測試時搜索方法,隨著計算規模的擴大,往往會因過度優化本質上不完美的獎勵代理而導致質量下降。我們引入了QAlign,一種新的測試時對齊方法。隨著測試時計算量的增加,QAlign會收斂於從每個特定提示的最優對齊分佈中採樣。通過採用文本生成領域中馬爾可夫鏈蒙特卡洛方法的最新進展,我們的方法能夠在不修改底層模型甚至無需訪問logit的情況下,生成更為對齊的輸出。我們在數學推理基準(GSM8K和GSM-Symbolic)上使用任務特定的RM展示了QAlign的有效性,相比於現有的測試時計算方法如best-of-n和多數投票,QAlign展現了持續的改進。此外,當應用於基於Tulu 3偏好數據集訓練的更為現實的RM時,QAlign在多樣化的數據集(GSM8K、MATH500、IFEval、MMLU-Redux和TruthfulQA)上均優於直接偏好優化(DPO)、best-of-n、多數投票和加權多數投票。作為一種利用額外計算在測試時對齊語言模型且不導致性能下降的實用解決方案,我們的方法拓展了無需進一步訓練即可從現成語言模型中獲取的能力極限。
English
Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

Summary

AI-Generated Summary

PDF22April 8, 2025