深入探討多模態推理的自我演進訓練
Diving into Self-Evolving Training for Multimodal Reasoning
December 23, 2024
作者: Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He
cs.AI
摘要
推理能力對於大型多模型模型(LMMs)至關重要。在缺乏多模式思維鏈標註數據的情況下,自我演進訓練已經成為增強推理能力的有效且可擴展的方法。儘管自我演進訓練的使用日益增長,尤其是在多模式推理的背景下,對於自我演進訓練的全面理解仍然有限。本文深入探討了用於多模式推理的自我演進訓練的細微差異,並指出了三個關鍵因素:訓練方法、獎勵模型和提示變化。我們系統地檢驗了每個因素,並探索各種配置如何影響訓練的有效性。我們的分析得出了一組每個因素的最佳實踐,旨在優化多模式推理。此外,我們探討了訓練過程中的自我演進動態以及自動平衡機制在提升性能方面的影響。在所有調查之後,我們提出了用於多模式推理的自我演進訓練的最終配方,將這些設計選擇總結為一個我們稱之為MSTaR(用於推理的多模式自我演進訓練)的框架,該框架對於不同基準測試中不同尺寸的模型都具有普遍有效性,例如,在MiniCPM-V-2.5(8B)、Phi-3.5-Vision(4B)和InternVL2(2B)等基準測試中,明顯超越了預演進模型,而無需使用額外的人類標註。我們相信這項研究填補了對於多模式推理的自我演進訓練的理解上的重要空白,並為未來研究提供了一個堅固的框架。我們的策略和獎勵模型以及收集的數據已經釋出,以促進對多模式推理的進一步研究。
English
Reasoning ability is essential for Large Multimodal Models (LMMs). In the
absence of multimodal chain-of-thought annotated data, self-evolving training,
where the model learns from its own outputs, has emerged as an effective and
scalable approach for enhancing reasoning abilities. Despite its growing usage,
a comprehensive understanding of self-evolving training, particularly in the
context of multimodal reasoning, remains limited. In this paper, we delve into
the intricacies of self-evolving training for multimodal reasoning, pinpointing
three key factors: Training Method, Reward Model, and Prompt Variation. We
systematically examine each factor and explore how various configurations
affect the training's effectiveness. Our analysis leads to a set of best
practices for each factor, aimed at optimizing multimodal reasoning.
Furthermore, we explore the Self-Evolution Dynamics during training and the
impact of automatic balancing mechanisms in boosting performance. After all the
investigations, we present a final recipe for self-evolving training in
multimodal reasoning, encapsulating these design choices into a framework we
call MSTaR (Multimodal Self-evolving Training for Reasoning), which is
universally effective for models with different sizes on various benchmarks,
e.g., surpassing the pre-evolved model significantly on 5 multimodal reasoning
benchmarks without using additional human annotations, as demonstrated on
MiniCPM-V-2.5 (8B), Phi-3.5-Vision (4B) and InternVL2 (2B). We believe this
study fills a significant gap in the understanding of self-evolving training
for multimodal reasoning and offers a robust framework for future research. Our
policy and reward models, as well as the collected data, is released to
facilitate further investigation in multimodal reasoning.Summary
AI-Generated Summary