URSA:理解和驗證多模式數學中的思維鏈推理

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

January 8, 2025
作者: Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, Yujiu Yang
cs.AI

摘要

鏈式思維(CoT)推理已被廣泛應用於大型語言模型(LLMs)的數學推理中。最近,在CoT軌跡上引入導數過程監督已引發關於在測試時增強擴展能力的討論,從而提升這些模型的潛力。然而,在多模態數學推理中,高質量CoT訓練數據的稀缺阻礙了現有模型實現高精度的CoT推理,並限制了測試時推理潛力的實現。在這項工作中,我們提出了一種三模塊綜合策略,將CoT蒸餾、軌跡格式重寫和格式統一相結合。這將產生一個高質量的多模態數學CoT推理指導微調數據集MMathCoT-1M。我們全面驗證了訓練過的URSA-7B模型在多個多模態數學基準測試中的最新技術(SOTA)表現。對於測試時擴展,我們引入了一種數據綜合策略,自動生成過程標註數據集,稱為DualMath-1.1M,重點放在解釋和邏輯上。通過在DualMath-1.1M上進一步訓練URSA-7B,我們從CoT推理能力過渡到堅固的監督能力。訓練過的URSA-RM-7B充當驗證器,有效提升了URSA-7B在測試時的表現。URSA-RM-7B還展示了出色的超出分佈(OOD)驗證能力,展示了其泛化能力。模型權重、訓練數據和代碼將開源。
English
Chain-of-thought (CoT) reasoning has been widely applied in the mathematical reasoning of Large Language Models (LLMs). Recently, the introduction of derivative process supervision on CoT trajectories has sparked discussions on enhancing scaling capabilities during test time, thereby boosting the potential of these models. However, in multimodal mathematical reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving high-precision CoT reasoning and has limited the realization of reasoning potential during test time. In this work, we propose a three-module synthesis strategy that integrates CoT distillation, trajectory-format rewriting, and format unification. It results in a high-quality CoT reasoning instruction fine-tuning dataset in multimodal mathematics, MMathCoT-1M. We comprehensively validate the state-of-the-art (SOTA) performance of the trained URSA-7B model on multiple multimodal mathematical benchmarks. For test-time scaling, we introduce a data synthesis strategy that automatically generates process annotation datasets, known as DualMath-1.1M, focusing on both interpretation and logic. By further training URSA-7B on DualMath-1.1M, we transition from CoT reasoning capabilities to robust supervision abilities. The trained URSA-RM-7B acts as a verifier, effectively enhancing the performance of URSA-7B at test time. URSA-RM-7B also demonstrates excellent out-of-distribution (OOD) verifying capabilities, showcasing its generalization. Model weights, training data and code will be open-sourced.

Summary

AI-Generated Summary

PDF483January 9, 2025