样本、审查和扩展：通过扩展验证实现有效的推理时间搜索

摘要

基于抽样的搜索是一种利用测试时计算的简单范式，涉及生成多个候选响应并选择最佳响应 —— 通常通过验证每个响应的正确性来完成。本文研究了影响基于抽样搜索的扩展趋势。我们的研究发现之一是，简单地扩展仅使用随机抽样和直接自验证的最小实现，将导致持续的性能改进，例如，将 Gemini v1.5 Pro 模型的推理能力提升到流行基准测试中 o1-Preview 之上。我们部分归因于基于抽样搜索的可扩展性，其中抽样更大的响应池进而提高验证准确性。我们进一步确定了两个有用的原则，用于通过测试时计算改进自验证能力：(1) 跨响应比较提供有关错误和幻觉位置的有用信号，(2) 不同的模型输出风格适用于不同的情境 —— 思维链对推理有用但更难验证。我们还发现，尽管可以引发准确的验证，但前沿模型展示了明显薄弱的开箱即用验证能力，并引入了一个基准来衡量这些缺陷上的进展。

English

Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.

样本、审查和扩展：通过扩展验证实现有效的推理时间搜索

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

摘要

Summary

Support