LMM-R1：通过两阶段规则强化学习赋能30亿参数语言模型，显著提升推理能力

摘要

提升大型多模态模型（LMMs）的推理能力面临独特挑战，这源于视觉感知与逻辑推理之间复杂的相互作用，尤其是在参数规模为3B的紧凑架构中，架构限制制约了推理能力和模态对齐。尽管基于规则的强化学习（RL）在纯文本领域表现出色，但其多模态扩展却遭遇两大关键障碍：（1）由于答案模糊及复杂推理示例稀缺导致的数据限制；（2）多模态预训练引发的基础推理能力下降。为应对这些挑战，我们提出了\method，一个两阶段框架，通过基础推理增强（FRE）随后进行多模态泛化训练（MGT），将基于规则的RL适应于多模态推理。FRE阶段首先利用纯文本数据和基于规则的RL强化推理能力，随后MGT阶段将这些推理能力泛化至多模态领域。在Qwen2.5-VL-Instruct-3B上的实验表明，\method在多模态和纯文本基准测试中分别实现了4.83%和4.5%的平均提升，在复杂的足球比赛任务中更是取得了3.63%的增益。这些结果验证了基于文本的推理增强能够有效促进多模态泛化，提供了一种绕过昂贵高质量多模态训练数据的高效范式。

English

Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \method, a two-stage framework adapting rule-based RL for multimodal reasoning through Foundational Reasoning Enhancement (FRE) followed by Multimodal Generalization Training (MGT). The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that \method achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.

LMM-R1：通过两阶段规则强化学习赋能30亿参数语言模型，显著提升推理能力

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

摘要

Summary

热门论文

1比特LLM时代：所有大型语言模型均为1.58比特。
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Qwen2.5 技术报告
Qwen2.5 Technical Report

DeepSeek-R1：通过强化学习激励LLMs中的推理能力
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Support

摘要

Summary

热门论文

1比特LLM时代：所有大型语言模型均为1.58比特。The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Qwen2.5 技术报告Qwen2.5 Technical Report

DeepSeek-R1：通过强化学习激励LLMs中的推理能力DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

1比特LLM时代：所有大型语言模型均为1.58比特。
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Qwen2.5 技术报告
Qwen2.5 Technical Report

DeepSeek-R1：通过强化学习激励LLMs中的推理能力
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning