探究自我演进训练在多模态推理中的应用

摘要

对于大型多模态模型（LMMs）来说，推理能力至关重要。在缺乏多模态思维链注释数据的情况下，自我演化训练已经成为增强推理能力的有效且可扩展的方法。尽管自我演化训练的使用越来越广泛，特别是在多模态推理的背景下，对其全面理解仍然有限。本文深入探讨了用于多模态推理的自我演化训练的复杂性，着重指出了三个关键因素：训练方法、奖励模型和提示变化。我们系统地研究了每个因素，并探讨了各种配置如何影响训练的有效性。我们的分析得出了针对每个因素的一套最佳实践，旨在优化多模态推理。此外，我们探讨了训练过程中的自我演化动态以及自动平衡机制在提升性能方面的影响。在所有调查之后，我们提出了用于多模态推理的自我演化训练的最终配方，将这些设计选择概括为一个名为MSTaR（用于推理的多模态自我演化训练）的框架，该框架对各种基准测试上不同规模的模型都具有普遍有效性，例如，在MiniCPM-V-2.5（8B）、Phi-3.5-Vision（4B）和InternVL2（2B）等基准测试上，明显超越了预演进模型，而无需使用额外的人类注释。我们相信这项研究填补了对于多模态推理的自我演化训练的理解中的重要空白，并为未来研究提供了一个强大的框架。我们的策略和奖励模型，以及收集的数据，已发布以促进在多模态推理领域的进一步研究。

English

Reasoning ability is essential for Large Multimodal Models (LMMs). In the absence of multimodal chain-of-thought annotated data, self-evolving training, where the model learns from its own outputs, has emerged as an effective and scalable approach for enhancing reasoning abilities. Despite its growing usage, a comprehensive understanding of self-evolving training, particularly in the context of multimodal reasoning, remains limited. In this paper, we delve into the intricacies of self-evolving training for multimodal reasoning, pinpointing three key factors: Training Method, Reward Model, and Prompt Variation. We systematically examine each factor and explore how various configurations affect the training's effectiveness. Our analysis leads to a set of best practices for each factor, aimed at optimizing multimodal reasoning. Furthermore, we explore the Self-Evolution Dynamics during training and the impact of automatic balancing mechanisms in boosting performance. After all the investigations, we present a final recipe for self-evolving training in multimodal reasoning, encapsulating these design choices into a framework we call MSTaR (Multimodal Self-evolving Training for Reasoning), which is universally effective for models with different sizes on various benchmarks, e.g., surpassing the pre-evolved model significantly on 5 multimodal reasoning benchmarks without using additional human annotations, as demonstrated on MiniCPM-V-2.5 (8B), Phi-3.5-Vision (4B) and InternVL2 (2B). We believe this study fills a significant gap in the understanding of self-evolving training for multimodal reasoning and offers a robust framework for future research. Our policy and reward models, as well as the collected data, is released to facilitate further investigation in multimodal reasoning.

探究自我演进训练在多模态推理中的应用

Diving into Self-Evolving Training for Multimodal Reasoning

摘要

Support