Leanabell-證明器:形式推理中的訓練後擴展
Leanabell-Prover: Posttraining Scaling in Formal Reasoning
April 8, 2025
作者: Jingyuan Zhang, Qi Wang, Xingguang Ji, Yahui Liu, Yang Yue, Fuzheng Zhang, Di Zhang, Guorui Zhou, Kun Gai
cs.AI
摘要
近期,通過大型語言模型(LLMs)在自動定理證明(ATP)領域的進展,凸顯了使用Lean 4代碼進行形式推理的潛力。然而,ATP尚未如OpenAI的O1/O3和Deepseek的R1所展示的那樣,因後訓練規模化而迎來革命性突破。在本研究中,我們全面探討了ATP的後訓練過程,旨在使其與自然語言推理模型的突破性進展保持一致。首先,我們利用混合數據集對現有的ATP模型進行持續訓練,該數據集包含大量陳述-證明對,以及旨在融入模擬人類推理和假設精煉的認知行為的附加數據。接著,我們探索了基於Lean 4編譯器返回結果獎勵的強化學習方法。通過我們設計的持續訓練和強化學習流程,我們成功提升了包括DeepSeek-Prover-v1.5和Goedel-Prover在內的現有形式證明器,在全證明生成領域達到了最先進的性能。例如,我們在MiniF2F上實現了59.8%的通過率(pass@32)。這是一項持續進行的項目,我們將逐步更新我們的研究發現,並公開我們的數據和訓練細節。
English
Recent advances in automated theorem proving (ATP) through LLMs have
highlighted the potential of formal reasoning with Lean 4 codes. However, ATP
has not yet be revolutionized by the recent posttraining scaling as
demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the
entire posttraining of ATP, aiming to align it with breakthroughs in reasoning
models in natural languages.To begin, we continual train current ATP models
with a hybrid dataset, which consists of numerous statement-proof pairs, and
additional data aimed at incorporating cognitive behaviors that emulate human
reasoning and hypothesis refinement. Next, we explore reinforcement learning
with the use of outcome reward returned by Lean 4 compiler. Through our
designed continual training and reinforcement learning processes, we have
successfully improved existing formal provers, including both
DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance
in the field of whole-proof generation. For example, we achieve a 59.8% pass
rate (pass@32) on MiniF2F. This is an on-going project and we will
progressively update our findings, release our data and training details.Summary
AI-Generated Summary