ChatPaper.aiChatPaper

Heimdall:生成式验证中的测试时缩放

Heimdall: test-time scaling on the generative verification

April 14, 2025
作者: Wenlei Shi, Xing Jin
cs.AI

摘要

一個AI系統創造和維護知識的能力,僅限於其能夠自我驗證這些知識的程度。近期關於長鏈思維推理的研究展示了大型語言模型在解決競賽問題上的巨大潛力,但其驗證能力仍然薄弱且未得到充分研究。本文提出Heimdall,一種長鏈思維驗證的大型語言模型,能夠精確判斷解決方案的正確性。通過純粹的強化學習,我們將競賽數學問題的驗證準確率從62.5%提升至94.5%。通過重複採樣的擴展,準確率進一步提升至97.5%。通過人類評估,Heimdall展現了令人印象深刻的泛化能力,成功檢測出訓練中未包含的複雜數學證明中的大部分問題。此外,我們提出悲觀驗證法,以擴展Heimdall的功能,提升問題解決的規模。該方法調用Heimdall來判斷來自求解模型的解決方案,並基於悲觀原則,選擇最可能正確且不確定性最小的解決方案。以DeepSeek-R1-Distill-Qwen-32B作為求解模型,悲觀驗證法將AIME2025上的解決方案準確率從54.2%提升至70.0%(計算預算增加16倍),並在更多計算預算下提升至83.3%。使用更強的求解模型Gemini 2.5 Pro,得分達到93.0%。最後,我們原型化了一個自動知識發現系統,這是一個三元系統,其中一個提出問題,另一個提供解決方案,第三個驗證解決方案。利用數據合成工作NuminaMath作為前兩個組件,Heimdall有效識別了數據集中的問題記錄,並揭示出近一半的數據存在缺陷,這與NuminaMath最近的消融研究結果有趣地吻合。
English
An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself. Recent work on long Chain-of-Thought reasoning has demonstrated great potential of LLMs on solving competitive problems, but their verification ability remains to be weak and not sufficiently investigated. In this paper, we propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. By scaling with repeated sampling, the accuracy further increases to 97.5%. Through human evaluation, Heimdall demonstrates impressive generalization capabilities, successfully detecting most issues in challenging math proofs, the type of which is not included during training. Furthermore, we propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving. It calls Heimdall to judge the solutions from a solver model and based on the pessimistic principle, selects the most likely correct solution with the least uncertainty. Taking DeepSeek-R1-Distill-Qwen-32B as the solver model, Pessimistic Verification improves the solution accuracy on AIME2025 from 54.2% to 70.0% with 16x compute budget and to 83.3% with more compute budget. With the stronger solver Gemini 2.5 Pro, the score reaches 93.0%. Finally, we prototype an automatic knowledge discovery system, a ternary system where one poses questions, another provides solutions, and the third verifies the solutions. Using the data synthesis work NuminaMath for the first two components, Heimdall effectively identifies problematic records within the dataset and reveals that nearly half of the data is flawed, which interestingly aligns with the recent ablation studies from NuminaMath.

Summary

AI-Generated Summary

PDF292April 16, 2025