何时求解，何时验证：面向大语言模型推理的计算优化问题求解与生成式验证

摘要

扩展测试时计算已成为提升大型语言模型（LLMs）推理能力的关键策略，尤其在数学问题求解等任务中。传统方法如自我一致性（Self-Consistency, SC）通过生成多个问题解决方案并采用多数投票选出最常见答案。另一种常见方法则是利用奖励模型（验证器）为每个解决方案打分，并选择最优者。生成式奖励模型（Generative Reward Models, GenRM）的最新进展将验证重构为下一个令牌预测任务，从而在推理时沿新维度实现扩展。具体而言，GenRM生成多条验证思维链来为每个解决方案评分。在有限的推理预算下，这引入了一个基本权衡：是应将预算用于通过SC扩展解决方案数量，还是生成较少解决方案并将计算资源分配给通过GenRM进行的验证？为解决这一问题，我们在固定推理预算下对比评估了GenRM与SC。有趣的是，我们发现，在多种模型和数据集上，对于大多数实际推理预算，SC比GenRM更具计算效率。例如，GenRM在消耗高达8倍推理计算量后首次与SC持平，且需要显著更多的计算资源才能超越SC。此外，我们推导了GenRM范式的推理扩展定律，揭示了计算最优推理更倾向于激进地扩展解决方案生成而非验证次数。我们的工作为通过平衡解决方案生成与验证来优化测试时扩展提供了实用指导。代码已发布于https://github.com/nishadsinghi/sc-genrm-scaling。

English

Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification. The code is available at https://github.com/nishadsinghi/sc-genrm-scaling.

何时求解，何时验证：面向大语言模型推理的计算优化问题求解与生成式验证

When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

摘要

Summary

Support

Support