样本、审查和扩展:通过扩展验证实现有效的推理时间搜索
Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification
February 3, 2025
作者: Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi
cs.AI
摘要
基于抽样的搜索是一种利用测试时计算的简单范式,涉及生成多个候选响应并选择最佳响应 —— 通常通过验证每个响应的正确性来完成。本文研究了影响基于抽样搜索的扩展趋势。我们的研究发现之一是,简单地扩展仅使用随机抽样和直接自验证的最小实现,将导致持续的性能改进,例如,将 Gemini v1.5 Pro 模型的推理能力提升到流行基准测试中 o1-Preview 之上。我们部分归因于基于抽样搜索的可扩展性,其中抽样更大的响应池进而提高验证准确性。我们进一步确定了两个有用的原则,用于通过测试时计算改进自验证能力:(1) 跨响应比较提供有关错误和幻觉位置的有用信号,(2) 不同的模型输出风格适用于不同的情境 —— 思维链对推理有用但更难验证。我们还发现,尽管可以引发准确的验证,但前沿模型展示了明显薄弱的开箱即用验证能力,并引入了一个基准来衡量这些缺陷上的进展。
English
Sampling-based search, a simple paradigm for utilizing test-time compute,
involves generating multiple candidate responses and selecting the best one --
typically by verifying each response for correctness. In this paper, we study
the scaling trends governing sampling-based search. Among our findings is that
simply scaling up a minimalist implementation that uses only random sampling
and direct self-verification results in sustained performance improvements
that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities
past that of o1-Preview on popular benchmarks. We partially attribute the
scalability of sampling-based search to a phenomenon of implicit scaling, where
sampling a larger pool of responses in turn improves verification accuracy. We
further identify two useful principles for improving self-verification
capabilities with test-time compute: (1) comparing across responses provides
helpful signals about the locations of errors and hallucinations, and (2)
different model output styles are useful for different contexts -- chains of
thought are useful for reasoning but harder to verify. We also find that,
though accurate verification can be elicited, frontier models demonstrate
remarkably weak out-of-box verification capabilities and introduce a benchmark
to measure progress on these deficiencies.Summary
AI-Generated Summary