OThink-MR1：通过动态强化学习激发多模态通用推理能力

摘要

多模态大语言模型（MLLMs）因其能够处理多样化的输入数据类型，并在各类应用中生成连贯且上下文相关的输出，而获得了广泛关注。尽管监督微调（SFT）作为提升MLLMs在特定任务优化中的能力的主要方法，但在培养关键的泛化推理能力方面往往表现不足。虽然强化学习（RL）在克服这些局限上展现出巨大潜力，但它面临两大挑战：(1) 其在多模态任务中的泛化能力仍待深入探索，(2) 其训练约束，如持续的Kullback-Leibler散度或钳制策略，常导致次优瓶颈。为应对这些挑战，我们提出了OThink-MR1，一款具备跨多模态任务深刻理解与推理能力的先进MLLM。具体而言，我们引入了带有动态Kullback-Leibler策略的群体相对策略优化（GRPO-D），显著提升了强化学习（RL）的性能。对于Qwen2-VL-2B-Instruct模型，GRPO-D在两个适配数据集上的同任务评估中，相较于SFT实现了超过5.72%的相对提升，相较于GRPO则提升了超过13.59%。此外，GRPO-D展现了卓越的跨任务泛化能力，在跨任务评估中，相较于SFT平均提升了超过61.63%。这些成果表明，采用GRPO-D训练的多模态大语言模型能够有效迁移至其他任务，凸显了我们提出的OThink-MR1模型在泛化推理能力上的卓越优势。

English

Multimodal Large Language Models (MLLMs) have gained significant traction for their ability to process diverse input data types and generate coherent, contextually relevant outputs across various applications. While supervised fine-tuning (SFT) has been the predominant approach to enhance MLLM capabilities in task-specific optimization, it often falls short in fostering crucial generalized reasoning abilities. Although reinforcement learning (RL) holds great promise in overcoming these limitations, it encounters two significant challenges: (1) its generalized capacities in multimodal tasks remain largely unexplored, and (2) its training constraints, including the constant Kullback-Leibler divergence or the clamp strategy, often result in suboptimal bottlenecks. To address these challenges, we propose OThink-MR1, an advanced MLLM equipped with profound comprehension and reasoning capabilities across multimodal tasks. Specifically, we introduce Group Relative Policy Optimization with a dynamic Kullback-Leibler strategy (GRPO-D), which markedly enhances reinforcement learning (RL) performance. For Qwen2-VL-2B-Instruct, GRPO-D achieves a relative improvement of more than 5.72% over SFT and more than 13.59% over GRPO in same-task evaluation on two adapted datasets. Furthermore, GRPO-D demonstrates remarkable cross-task generalization capabilities, with an average relative improvement of more than 61.63% over SFT in cross-task evaluation. These results highlight that the MLLM trained with GRPO-D on one multimodal task can be effectively transferred to another task, underscoring the superior generalized reasoning capabilities of our proposed OThink-MR1 model.

OThink-MR1：通过动态强化学习激发多模态通用推理能力

OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning

摘要

Summary

Support

Support