透過主動學習實現高效的過程獎勵模型訓練

摘要

過程獎勵模型（PRMs）為大型語言模型（LLMs）提供步驟級別的監督，但無論是對人類還是LLMs而言，擴展訓練數據的註釋規模仍然具有挑戰性。為解決這一限制，我們提出了一種主動學習方法——ActPRM，該方法主動選擇最不確定的樣本進行訓練，從而大幅降低標註成本。在訓練過程中，我們使用PRM在前向傳播後估計不確定性，僅保留高度不確定的數據。隨後，一個能力強但成本高的推理模型對這些數據進行標註。接著，我們根據標籤計算損失並更新PRM的權重。我們在基於池的主動學習設置中比較了ActPRM與普通微調的效果，結果表明ActPRM減少了50%的註釋量，但達到了相當甚至更好的性能。除了註釋效率的提升，我們還通過ActPRM過濾了超過100萬條數學推理軌跡，保留了60%的數據。在這一精選數據集上進行後續訓練，我們在ProcessBench（75.0%）和PRMBench（65.5%）上取得了與同規模模型相比新的最高水平（SOTA）PRM。

English

Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss with respect to the labels and update the PRM's weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduces 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.

透過主動學習實現高效的過程獎勵模型訓練

Efficient Process Reward Model Training via Active Learning

摘要

Summary

Support

Support