ChatPaper.aiChatPaper

通过主动学习实现高效过程奖励模型训练

Efficient Process Reward Model Training via Active Learning

April 14, 2025
作者: Keyu Duan, Zichen Liu, Xin Mao, Tianyu Pang, Changyu Chen, Qiguang Chen, Michael Qizhe Shieh, Longxu Dou
cs.AI

摘要

过程奖励模型(PRMs)为大型语言模型(LLMs)提供了步骤级别的监督,但无论是对于人类还是LLMs而言,扩大训练数据标注规模仍具挑战。为应对这一局限,我们提出了一种主动学习方法——ActPRM,该方法主动选择最不确定的样本进行训练,显著降低了标注成本。在训练过程中,我们利用PRM在前向传播后估计不确定性,仅保留高度不确定的数据。随后,一个能力强但成本高的推理模型对这些数据进行标注。接着,我们根据标注计算损失并更新PRM的权重。在基于池的主动学习设置中,我们将ActPRM与普通微调进行比较,结果表明ActPRM减少了50%的标注量,却实现了相当甚至更优的性能。除了标注效率的提升,我们进一步通过ActPRM筛选了超过100万条数学推理轨迹,保留了60%的数据。在此精选数据集上的后续训练,使得PRM在ProcessBench(75.0%)和PRMBench(65.5%)上相比同等规模模型,达到了新的最优(SOTA)水平。
English
Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss with respect to the labels and update the PRM's weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduces 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.

Summary

AI-Generated Summary

PDF132April 16, 2025