ChatPaper.aiChatPaper

Leanabell-證明器:形式推理中的訓練後擴展

Leanabell-Prover: Posttraining Scaling in Formal Reasoning

April 8, 2025
作者: Jingyuan Zhang, Qi Wang, Xingguang Ji, Yahui Liu, Yang Yue, Fuzheng Zhang, Di Zhang, Guorui Zhou, Kun Gai
cs.AI

摘要

近期,通過大型語言模型(LLMs)在自動定理證明(ATP)領域的進展,凸顯了使用Lean 4代碼進行形式推理的潛力。然而,ATP尚未如OpenAI的O1/O3和Deepseek的R1所展示的那樣,因後訓練規模化而迎來革命性突破。在本研究中,我們全面探討了ATP的後訓練過程,旨在使其與自然語言推理模型的突破性進展保持一致。首先,我們利用混合數據集對現有的ATP模型進行持續訓練,該數據集包含大量陳述-證明對,以及旨在融入模擬人類推理和假設精煉的認知行為的附加數據。接著,我們探索了基於Lean 4編譯器返回結果獎勵的強化學習方法。通過我們設計的持續訓練和強化學習流程,我們成功提升了包括DeepSeek-Prover-v1.5和Goedel-Prover在內的現有形式證明器,在全證明生成領域達到了最先進的性能。例如,我們在MiniF2F上實現了59.8%的通過率(pass@32)。這是一項持續進行的項目,我們將逐步更新我們的研究發現,並公開我們的數據和訓練細節。
English
Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttraining scaling as demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the entire posttraining of ATP, aiming to align it with breakthroughs in reasoning models in natural languages.To begin, we continual train current ATP models with a hybrid dataset, which consists of numerous statement-proof pairs, and additional data aimed at incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. Next, we explore reinforcement learning with the use of outcome reward returned by Lean 4 compiler. Through our designed continual training and reinforcement learning processes, we have successfully improved existing formal provers, including both DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance in the field of whole-proof generation. For example, we achieve a 59.8% pass rate (pass@32) on MiniF2F. This is an on-going project and we will progressively update our findings, release our data and training details.

Summary

AI-Generated Summary

PDF62April 9, 2025