ChatPaper.aiChatPaper

R1-Zero在20亿参数非SFT模型上的视觉推理“顿悟时刻”

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

March 7, 2025
作者: Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh
cs.AI

摘要

近期,DeepSeek R1展示了如何通过结合简单规则激励的强化学习,促使大型语言模型自主发展出复杂推理能力,其标志性特征为“顿悟时刻”,即模型在训练过程中展现出自我反思及回答长度的显著增加。然而,将这一成功扩展至多模态推理领域的尝试,往往难以重现这些关键特性。本报告中,我们首次成功在仅基于非监督微调(SFT)的20亿参数模型上,复现了多模态推理中的这些涌现特性。以Qwen2-VL-2B为起点,直接在SAT数据集上应用强化学习,我们的模型在CVBench上达到了59.47%的准确率,较基础模型提升约30%,并超越所有SFT设置约2%。此外,我们分享了在尝试使用强化学习结合指令模型实现类似R1推理能力过程中的失败案例与洞见,旨在揭示其中面临的挑战。我们的主要观察包括:(1)在指令模型上应用强化学习常导致推理路径趋于简单化;(2)单纯的长度奖励机制难以有效激发推理能力。项目代码已公开于https://github.com/turningpoint-ai/VisualThinker-R1-Zero。
English
Recently DeepSeek R1 demonstrated how reinforcement learning with simple rule-based incentives can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models. aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities. The project code is available at https://github.com/turningpoint-ai/VisualThinker-R1-Zero

Summary

AI-Generated Summary

PDF482March 10, 2025