ChatPaper.aiChatPaper

TrustGeoGen:可擴展且形式化驗證的數據引擎,用於可信賴的多模態幾何問題求解

TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving

April 22, 2025
作者: Daocheng Fu, Zijun Chen, Renqiu Xia, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Junchi Yan, Botian Shi, Bo Zhang, Yu Qiao
cs.AI

摘要

數學幾何問題求解(GPS)通常需要有效地整合多模態信息並確保邏輯的可驗證性。儘管大語言模型在通用問題解決方面發展迅速,但在方法和基準測試方面仍存在未解之謎,尤其是考慮到現有的合成GPS基準往往無法自我驗證,且由於大語言模型的幻覺而包含噪聲和自相矛盾的信息。本文提出了一種名為TrustGeoGen的可擴展數據引擎,用於問題生成,並通過形式化驗證提供原則性的基準,我們認為這為GPS方法的進一步發展奠定了基礎。該引擎通過四項關鍵創新合成幾何數據:1)圖形、文本描述和逐步解決方案的多模態對齊生成;2)確保推理路徑符合規則的形式化驗證;3)通過遞歸狀態生成實現複雜度提升的引導機制;以及4)我們設計的GeoExplore系列算法,同時生成多解變體和自我反思的回溯軌跡。通過形式邏輯驗證,TrustGeoGen生成了具有模態完整性的GeoTrust-200K數據集,以及GeoTrust-test測試集。實驗表明,最先進的模型在GeoTrust-test上的準確率僅為49.17%,展示了其評估的嚴格性。關鍵的是,在GeoTrust上訓練的模型在GeoQA上實現了OOD泛化,相對於OpenAI-o1偽標註,顯著減少了邏輯不一致性。我們的代碼可在https://github.com/Alpha-Innovator/TrustGeoGen獲取。
English
Mathematical geometric problem solving (GPS) often requires effective integration of multimodal information and verifiable logical coherence. Despite the fast development of large language models in general problem solving, it remains unresolved regarding with both methodology and benchmarks, especially given the fact that exiting synthetic GPS benchmarks are often not self-verified and contain noise and self-contradicted information due to the illusion of LLMs. In this paper, we propose a scalable data engine called TrustGeoGen for problem generation, with formal verification to provide a principled benchmark, which we believe lays the foundation for the further development of methods for GPS. The engine synthesizes geometric data through four key innovations: 1) multimodal-aligned generation of diagrams, textual descriptions, and stepwise solutions; 2) formal verification ensuring rule-compliant reasoning paths; 3) a bootstrapping mechanism enabling complexity escalation via recursive state generation and 4) our devised GeoExplore series algorithms simultaneously produce multi-solution variants and self-reflective backtracking traces. By formal logical verification, TrustGeoGen produces GeoTrust-200K dataset with guaranteed modality integrity, along with GeoTrust-test testset. Experiments reveal the state-of-the-art models achieve only 49.17\% accuracy on GeoTrust-test, demonstrating its evaluation stringency. Crucially, models trained on GeoTrust achieve OOD generalization on GeoQA, significantly reducing logical inconsistencies relative to pseudo-label annotated by OpenAI-o1. Our code is available at https://github.com/Alpha-Innovator/TrustGeoGen

Summary

AI-Generated Summary

PDF41April 29, 2025