大型语言模型测试时间计算的简单且可证明的扩展定律
A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models
November 29, 2024
作者: Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou
cs.AI
摘要
我们提出了一种通用的两阶段算法,该算法在大型语言模型(LLMs)的测试时间计算中具有可证明的可伸缩规律。给定一个输入问题,所提出的算法首先生成N个候选解,然后通过多轮淘汰赛选择最佳解,其中每对候选解会进行K次比较,只有胜者才能晋级到下一轮。在一种极简实现中,两个阶段均可仅通过黑盒LLM执行,无需其他任何东西(例如,无需外部验证器或奖励模型),解决一个输入问题需要总共N次(K + 1)高度可并行化的LLM调用。假设生成的候选解正确的概率为p_{gen} > 0,且一对正确和错误解之间的比较以概率p_{comp} > 0.5(即高于随机猜测)确定正确的胜者,我们在理论上证明了所提出算法的失败概率会随着N和K指数级地衰减至零:$P(最终输出不正确) ≤ (1 - p_{gen})^N + ⌈log_2 N⌉e^{-2 K (p_{comp} - 0.5)^2}.$ 我们在具有挑战性的MMLU-Pro基准测试中的实证结果验证了技术假设,以及所提出算法的有效性和通过增加测试时间计算规模带来的收益。
English
We propose a general two-stage algorithm that enjoys a provable scaling law
for the test-time compute of large language models (LLMs). Given an input
problem, the proposed algorithm first generates N candidate solutions, and
then chooses the best one via a multiple-round knockout tournament where each
pair of candidates are compared for K times and only the winners move on to
the next round. In a minimalistic implementation, both stages can be executed
with a black-box LLM alone and nothing else (e.g., no external verifier or
reward model), and a total of N times (K + 1) highly parallelizable LLM
calls are needed for solving an input problem. Assuming that a generated
candidate solution is correct with probability p_{gen} > 0 and a
comparison between a pair of correct and incorrect solutions identifies the
right winner with probability p_{comp} > 0.5 (i.e., better than a
random guess), we prove theoretically that the failure probability of the
proposed algorithm decays to zero exponentially with respect to N and K:
$P(final output is incorrect) le (1 - p_{gen})^N +
lceil log_2 N rceil e^{-2 K (p_{comp} - 0.5)^2}.$ Our empirical
results with the challenging MMLU-Pro benchmark validate the technical
assumptions, as well as the efficacy of the proposed algorithm and the gains
from scaling up its test-time compute.Summary
AI-Generated Summary