ChatPaper.aiChatPaper

Heimdall:生成式验证中的测试时缩放

Heimdall: test-time scaling on the generative verification

April 14, 2025
作者: Wenlei Shi, Xing Jin
cs.AI

摘要

一个AI系统能够创建和维护知识的程度,取决于其自我验证这些知识的能力。近期关于长链思维推理的研究展示了大型语言模型(LLMs)在解决竞争性问题上的巨大潜力,但其验证能力仍显薄弱,且未得到充分探究。本文提出Heimdall,一款专长于长链思维验证的LLM,能够精准判断解决方案的正确性。通过纯强化学习,我们在竞争性数学问题上将验证准确率从62.5%提升至94.5%。借助重复采样的扩展,准确率进一步攀升至97.5%。经人类评估,Heimdall展现了卓越的泛化能力,成功识别出训练中未包含的复杂数学证明中的多数问题。此外,我们提出悲观验证法,以扩展Heimdall的功能,助力问题求解的规模化。该方法调用Heimdall评判来自求解模型的解答,并基于悲观原则,选择最可能正确且不确定性最小的解。以DeepSeek-R1-Distill-Qwen-32B作为求解模型,悲观验证在AIME2025上将解答准确率从54.2%提升至70.0%,计算预算增加16倍时达到83.3%。采用更强大的求解器Gemini 2.5 Pro,得分攀升至93.0%。最后,我们原型化了一个自动知识发现系统,这是一个三元系统,其中一方提出问题,另一方提供解答,第三方验证解答。利用NuminaMath的数据合成工作作为前两个组件,Heimdall有效识别了数据集中的问题记录,揭示出近半数数据存在缺陷,这一发现与NuminaMath最近的消融研究结果不谋而合。
English
An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself. Recent work on long Chain-of-Thought reasoning has demonstrated great potential of LLMs on solving competitive problems, but their verification ability remains to be weak and not sufficiently investigated. In this paper, we propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. By scaling with repeated sampling, the accuracy further increases to 97.5%. Through human evaluation, Heimdall demonstrates impressive generalization capabilities, successfully detecting most issues in challenging math proofs, the type of which is not included during training. Furthermore, we propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving. It calls Heimdall to judge the solutions from a solver model and based on the pessimistic principle, selects the most likely correct solution with the least uncertainty. Taking DeepSeek-R1-Distill-Qwen-32B as the solver model, Pessimistic Verification improves the solution accuracy on AIME2025 from 54.2% to 70.0% with 16x compute budget and to 83.3% with more compute budget. With the stronger solver Gemini 2.5 Pro, the score reaches 93.0%. Finally, we prototype an automatic knowledge discovery system, a ternary system where one poses questions, another provides solutions, and the third verifies the solutions. Using the data synthesis work NuminaMath for the first two components, Heimdall effectively identifies problematic records within the dataset and reveals that nearly half of the data is flawed, which interestingly aligns with the recent ablation studies from NuminaMath.

Summary

AI-Generated Summary

PDF322April 16, 2025