VisualPRM：一种面向多模态推理的高效过程奖励模型

摘要

我们推出了VisualPRM，这是一款拥有80亿参数的高级多模态过程奖励模型（PRM），它通过Best-of-N（BoN）评估策略显著提升了现有多模态大语言模型（MLLMs）在不同模型规模和系列中的推理能力。具体而言，我们的模型提升了三种类型MLLMs及四种不同模型规模的推理表现。即便应用于性能卓越的InternVL2.5-78B模型，它也在七个多模态推理基准测试中实现了5.9分的提升。实验结果表明，在BoN评估中，我们的模型相较于结果奖励模型和自一致性方法展现出更优的性能。为了促进多模态PRMs的训练，我们利用自动化数据管道构建了包含40万条数据的多模态过程监督数据集VisualPRM400K。针对多模态PRMs的评估，我们提出了VisualProcessBench，这是一个带有人工标注步骤正确性标签的基准测试，旨在衡量PRMs在多模态推理任务中检测错误步骤的能力。我们期望这项工作能激发更多未来研究，并为MLLMs的发展贡献力量。我们的模型、数据集及基准测试已发布于https://internvl.github.io/blog/2025-03-13-VisualPRM/。

English

We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.

VisualPRM：一种面向多模态推理的高效过程奖励模型

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

摘要

Summary

Support