URSA: マルチモーダル数学における連鎖思考推論の理解と検証

要旨

Chain-of-thought (CoT) 推論は、大規模言語モデル（LLMs）の数学的推論に広く適用されています。最近、CoT 軌跡に対する導関数プロセス監視の導入により、テスト時のスケーリング能力の向上に関する議論が活発化し、これらのモデルの潜在能力を高める可能性が生まれました。しかし、多様なモードの数学的推論において、高品質な CoT トレーニングデータの不足が既存のモデルが高精度な CoT 推論を達成するのを妨げ、テスト時の推論潜在能力の実現を制限しています。本研究では、CoT 蒸留、軌跡形式の書き直し、および形式の統一を統合した三つのモジュール合成戦略を提案します。これにより、多様なモードの数学における高品質な CoT 推論指示の微調整データセットである MMathCoT-1M が生成されます。我々は、訓練された URSA-7B モデルの最先端のパフォーマンスを、複数の多様なモードの数学ベンチマークで包括的に検証します。テスト時のスケーリングにおいては、解釈と論理の両方に焦点を当てたプロセス注釈データセットである DualMath-1.1M を自動生成するデータ合成戦略を導入します。DualMath-1.1M 上で URSA-7B をさらにトレーニングすることで、CoT 推論能力から堅牢な監督能力への移行を実現します。訓練された URSA-RM-7B は検証者として機能し、テスト時の URSA-7B のパフォーマンスを効果的に向上させます。URSA-RM-7B はまた、優れた OOD 検証能力を示し、その汎化能力を示しています。モデルの重み、トレーニングデータ、コードはオープンソース化されます。

English

Chain-of-thought (CoT) reasoning has been widely applied in the mathematical reasoning of Large Language Models (LLMs). Recently, the introduction of derivative process supervision on CoT trajectories has sparked discussions on enhancing scaling capabilities during test time, thereby boosting the potential of these models. However, in multimodal mathematical reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving high-precision CoT reasoning and has limited the realization of reasoning potential during test time. In this work, we propose a three-module synthesis strategy that integrates CoT distillation, trajectory-format rewriting, and format unification. It results in a high-quality CoT reasoning instruction fine-tuning dataset in multimodal mathematics, MMathCoT-1M. We comprehensively validate the state-of-the-art (SOTA) performance of the trained URSA-7B model on multiple multimodal mathematical benchmarks. For test-time scaling, we introduce a data synthesis strategy that automatically generates process annotation datasets, known as DualMath-1.1M, focusing on both interpretation and logic. By further training URSA-7B on DualMath-1.1M, we transition from CoT reasoning capabilities to robust supervision abilities. The trained URSA-RM-7B acts as a verifier, effectively enhancing the performance of URSA-7B at test time. URSA-RM-7B also demonstrates excellent out-of-distribution (OOD) verifying capabilities, showcasing its generalization. Model weights, training data and code will be open-sourced.

URSA: マルチモーダル数学における連鎖思考推論の理解と検証

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

要旨

Summary

Support