マルチモーダル推論のための自己進化トレーニングに没入する

要旨

大規模多モーダルモデル（LMMs）においては、推論能力は不可欠です。多モーダルの連鎖的思考アノテーションデータが不足している場合、モデルが自らの出力から学習する自己進化型トレーニングが、推論能力を向上させるための効果的かつスケーラブルなアプローチとして登場しています。その利用が増加しているにも関わらず、特に多モーダル推論の文脈における自己進化型トレーニングの包括的な理解は限られています。本論文では、多モーダル推論のための自己進化型トレーニングの複雑さに深く踏み込み、トレーニング方法、報酬モデル、およびプロンプトの変化という3つの主要要因を特定します。各要因を体系的に検証し、さまざまな構成がトレーニングの効果にどのように影響するかを探ります。当社の分析により、各要因に対する最適なベストプラクティスの一連が導かれ、多モーダル推論を最適化することを目指しています。さらに、トレーニング中の自己進化ダイナミクスと、パフォーマンス向上における自動バランシングメカニズムの影響を探求します。すべての調査を経て、多モーダル推論における自己進化型トレーニングの最終的なレシピを提示し、これらの設計選択をMSTaR（Reasoning用のMultimodal Self-evolving Training）と呼ぶフレームワークに結集させます。このフレームワークは、異なるベンチマークで異なるサイズのモデルに対して普遍的に効果的であり、MiniCPM-V-2.5（8B）、Phi-3.5-Vision（4B）、InternVL2（2B）などの5つの多モーダル推論ベンチマークで、追加の人間のアノテーションを使用せずに事前進化モデルを大幅に上回ることを実証しています。この研究は、多モーダル推論のための自己進化型トレーニングの理解における重要なギャップを埋め、将来の研究のための堅牢なフレームワークを提供しています。当社のポリシーおよび報酬モデル、収集されたデータは、多モーダル推論におけるさらなる調査を促進するために公開されています。

English

Reasoning ability is essential for Large Multimodal Models (LMMs). In the absence of multimodal chain-of-thought annotated data, self-evolving training, where the model learns from its own outputs, has emerged as an effective and scalable approach for enhancing reasoning abilities. Despite its growing usage, a comprehensive understanding of self-evolving training, particularly in the context of multimodal reasoning, remains limited. In this paper, we delve into the intricacies of self-evolving training for multimodal reasoning, pinpointing three key factors: Training Method, Reward Model, and Prompt Variation. We systematically examine each factor and explore how various configurations affect the training's effectiveness. Our analysis leads to a set of best practices for each factor, aimed at optimizing multimodal reasoning. Furthermore, we explore the Self-Evolution Dynamics during training and the impact of automatic balancing mechanisms in boosting performance. After all the investigations, we present a final recipe for self-evolving training in multimodal reasoning, encapsulating these design choices into a framework we call MSTaR (Multimodal Self-evolving Training for Reasoning), which is universally effective for models with different sizes on various benchmarks, e.g., surpassing the pre-evolved model significantly on 5 multimodal reasoning benchmarks without using additional human annotations, as demonstrated on MiniCPM-V-2.5 (8B), Phi-3.5-Vision (4B) and InternVL2 (2B). We believe this study fills a significant gap in the understanding of self-evolving training for multimodal reasoning and offers a robust framework for future research. Our policy and reward models, as well as the collected data, is released to facilitate further investigation in multimodal reasoning.

マルチモーダル推論のための自己進化トレーニングに没入する

Diving into Self-Evolving Training for Multimodal Reasoning

要旨

Summary

Support

Support