ChatPaper.aiChatPaper

多样推理与验证助力高级推理

Diverse Inference and Verification for Advanced Reasoning

February 14, 2025
作者: Iddo Drori, Gaston Longhitano, Mao Mao, Seunghwan Hyun, Yuke Zhang, Sungjun Park, Zachary Meeks, Xin-Yu Zhang, Ben Segev, Howard Yong, Nakul Verma, Avi Shporer, Alon Amit, Madeleine Udell
cs.AI

摘要

诸如OpenAI的o1、o3及DeepSeek R1等推理型大语言模型在数学与编程领域已取得显著进展,但在应对国际数学奥林匹克(IMO)组合问题、抽象与推理语料库(ARC)谜题及“人类终极考试”(HLE)等高级任务时仍面临挑战。我们采用了一种多样化的推理策略,在测试时融合了多种模型与方法。研究发现,对数学与编程问题进行验证,以及对其他问题实施拒绝采样,既简便又有效。我们通过Lean自动验证IMO问题的解答正确性,通过代码验证ARC谜题,并发现最佳N选一策略能有效回答HLE问题。我们的方法将IMO组合问题的解答准确率从33.3%提升至77.8%,HLE问题的准确率从8%提高至37%,并解决了948名人类未能破解的80%的ARC谜题,以及o3高算力模型未能解决的26.5%的ARC谜题。通过测试时模拟、强化学习及结合推理反馈的元学习,我们通过调整代理图表示、变换提示、代码及数据集,提升了模型的泛化能力。我们的方法可靠、鲁棒且可扩展,秉承可复现研究的精神,我们将在论文发表后将其公开。
English
Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as International Mathematical Olympiad (IMO) combinatorics problems, Abstraction and Reasoning Corpus (ARC) puzzles, and Humanity's Last Exam (HLE) questions. We use a diverse inference approach that combines multiple models and methods at test time. We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective. We automatically verify correctness of solutions to IMO problems by Lean, and ARC puzzles by code, and find that best-of-N effectively answers HLE questions. Our approach increases answer accuracy on IMO combinatorics problems from 33.3% to 77.8%, accuracy on HLE questions from 8% to 37%, and solves 80% of ARC puzzles that 948 humans could not and 26.5% of ARC puzzles that o3 high compute does not. Test-time simulations, reinforcement learning, and meta-learning with inference feedback improve generalization by adapting agent graph representations and varying prompts, code, and datasets. Our approach is reliable, robust, and scalable, and in the spirit of reproducible research, we will make it publicly available upon publication.

Summary

AI-Generated Summary

PDF173February 17, 2025