SPC：通過對抗性遊戲進化自我對弈評判器以提升大語言模型推理能力

摘要

評估大型語言模型（LLM）逐步推理的可靠性，例如思維鏈（Chain-of-Thought），仍然具有挑戰性，這主要是由於獲取高質量的步驟級監督既困難又成本高昂。本文提出了一種新方法——自我對抗評判器（Self-Play Critic, SPC），該方法通過對抗性自我對抗遊戲，讓評判模型逐步提升其評估推理步驟的能力，從而無需手動進行步驟級註釋。SPC涉及微調基礎模型的兩個副本，分別扮演兩個角色：一個是“狡猾生成器”，其故意生成難以檢測的錯誤步驟；另一個是“評判器”，其分析推理步驟的正確性。這兩個模型參與一場對抗性遊戲，生成器旨在欺騙評判器，而評判器則試圖識別生成器的錯誤。基於遊戲結果的強化學習，模型不斷迭代改進；每場對抗的勝者獲得正獎勵，敗者獲得負獎勵，從而推動持續的自我進化。在三個推理過程基準測試（ProcessBench、PRM800K、DeltaBench）上的實驗表明，我們的SPC逐步提升了其錯誤檢測能力（例如，在ProcessBench上的準確率從70.8%提升至77.7%），並超越了包括蒸餾R1模型在內的強基準。此外，將SPC應用於指導多樣化LLM的測試時搜索，顯著提升了它們在MATH500和AIME2024上的數學推理性能，超越了最先進的過程獎勵模型。

English

Evaluating the step-by-step reliability of large language model (LLM) reasoning, such as Chain-of-Thought, remains challenging due to the difficulty and cost of obtaining high-quality step-level supervision. In this paper, we introduce Self-Play Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the need for manual step-level annotation. SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" that deliberately produces erroneous steps designed to be difficult to detect, and a "critic" that analyzes the correctness of reasoning steps. These two models engage in an adversarial game in which the generator aims to fool the critic, while the critic model seeks to identify the generator's errors. Using reinforcement learning based on the game outcomes, the models iteratively improve; the winner of each confrontation receives a positive reward and the loser receives a negative reward, driving continuous self-evolution. Experiments on three reasoning process benchmarks (ProcessBench, PRM800K, DeltaBench) demonstrate that our SPC progressively enhances its error detection capabilities (e.g., accuracy increases from 70.8% to 77.7% on ProcessBench) and surpasses strong baselines, including distilled R1 model. Furthermore, applying SPC to guide the test-time search of diverse LLMs significantly improves their mathematical reasoning performance on MATH500 and AIME2024, outperforming state-of-the-art process reward models.

SPC：通過對抗性遊戲進化自我對弈評判器以提升大語言模型推理能力

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

摘要

Summary

Support

Support