透過主動學習實現高效的過程獎勵模型訓練
Efficient Process Reward Model Training via Active Learning
April 14, 2025
作者: Keyu Duan, Zichen Liu, Xin Mao, Tianyu Pang, Changyu Chen, Qiguang Chen, Michael Qizhe Shieh, Longxu Dou
cs.AI
摘要
過程獎勵模型(PRMs)為大型語言模型(LLMs)提供步驟級別的監督,但無論是對人類還是LLMs而言,擴展訓練數據的註釋規模仍然具有挑戰性。為解決這一限制,我們提出了一種主動學習方法——ActPRM,該方法主動選擇最不確定的樣本進行訓練,從而大幅降低標註成本。在訓練過程中,我們使用PRM在前向傳播後估計不確定性,僅保留高度不確定的數據。隨後,一個能力強但成本高的推理模型對這些數據進行標註。接著,我們根據標籤計算損失並更新PRM的權重。我們在基於池的主動學習設置中比較了ActPRM與普通微調的效果,結果表明ActPRM減少了50%的註釋量,但達到了相當甚至更好的性能。除了註釋效率的提升,我們還通過ActPRM過濾了超過100萬條數學推理軌跡,保留了60%的數據。在這一精選數據集上進行後續訓練,我們在ProcessBench(75.0%)和PRMBench(65.5%)上取得了與同規模模型相比新的最高水平(SOTA)PRM。
English
Process Reward Models (PRMs) provide step-level supervision to large language
models (LLMs), but scaling up training data annotation remains challenging for
both humans and LLMs. To address this limitation, we propose an active learning
approach, ActPRM, which proactively selects the most uncertain samples for
training, substantially reducing labeling costs. During training, we use the
PRM to estimate uncertainty after the forward pass, retaining only highly
uncertain data. A capable yet costly reasoning model then labels this data.
Then we compute the loss with respect to the labels and update the PRM's
weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active
learning setting, demonstrating that ActPRM reduces 50% annotation, but
achieving the comparable or even better performance. Beyond annotation
efficiency, we further advance the actively trained PRM by filtering over 1M+
math reasoning trajectories with ActPRM, retaining 60% of the data. A
subsequent training on this selected dataset yields a new state-of-the-art
(SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same
sized models.Summary
AI-Generated Summary