R1-Onevision：通过跨模态形式化推进广义多模态推理

摘要

大型语言模型在复杂文本任务中展现出了卓越的推理能力。然而，多模态推理——即需要整合视觉与文本信息——仍是一项重大挑战。现有的视觉-语言模型往往难以有效分析和推理视觉内容，导致在复杂推理任务上表现欠佳。此外，缺乏全面的基准测试也阻碍了对多模态推理能力的准确评估。本文中，我们提出了R1-Onevision，一个旨在弥合视觉感知与深度推理之间鸿沟的多模态推理模型。为此，我们设计了一种跨模态推理管道，将图像转化为形式化的文本表示，从而支持精确的语言基础推理。利用这一管道，我们构建了R1-Onevision数据集，该数据集提供了跨多个领域的详细、逐步的多模态推理标注。我们进一步通过监督微调和强化学习开发了R1-Onevision模型，以培养高级推理能力和强大的泛化能力。为了全面评估不同层次的多模态推理表现，我们引入了R1-Onevision-Bench，这是一个与人类教育阶段对齐的基准测试，涵盖了从初中到大学及以上的考试内容。实验结果显示，R1-Onevision在多个具有挑战性的多模态推理基准测试中达到了最先进的性能，超越了GPT-4o和Qwen2.5-VL等模型。

English

Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.

R1-Onevision：通过跨模态形式化推进广义多模态推理

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

摘要

Summary

Support