大型語言模型的測試時間計算的一個簡單且可證明的擴展法則
A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models
November 29, 2024
作者: Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou
cs.AI
摘要
我們提出了一種通用的兩階段演算法,該演算法對於大型語言模型(LLMs)的測試時計算具有可證明的擴展定律。對於一個輸入問題,該提出的演算法首先生成N個候選解,然後通過多輪淘汰賽選擇最佳解,其中每對候選解會進行K次比較,只有勝出者才能進入下一輪。在一個極簡的實現中,兩個階段都可以僅使用黑盒LLM執行,無需其他東西(例如,沒有外部驗證器或獎勵模型),解決一個輸入問題需要總共N次(K + 1)高度可並行化的LLM呼叫。假設生成的候選解正確的概率為p_{gen} > 0,並且一對正確和不正確解之間的比較以概率p_{comp} > 0.5確定正確的勝出者(即,優於隨機猜測),我們在理論上證明了該提出的演算法的失敗概率隨著N和K呈指數級下降:$P(final output is incorrect) le (1 - p_{gen})^N + lceil log_2 N rceil e^{-2 K (p_{comp} - 0.5)^2}.$ 我們在具有挑戰性的MMLU-Pro基準測試中的實證結果驗證了技術假設,以及提出的演算法的有效性和從擴展其測試時計算中獲得的收益。
English
We propose a general two-stage algorithm that enjoys a provable scaling law
for the test-time compute of large language models (LLMs). Given an input
problem, the proposed algorithm first generates N candidate solutions, and
then chooses the best one via a multiple-round knockout tournament where each
pair of candidates are compared for K times and only the winners move on to
the next round. In a minimalistic implementation, both stages can be executed
with a black-box LLM alone and nothing else (e.g., no external verifier or
reward model), and a total of N times (K + 1) highly parallelizable LLM
calls are needed for solving an input problem. Assuming that a generated
candidate solution is correct with probability p_{gen} > 0 and a
comparison between a pair of correct and incorrect solutions identifies the
right winner with probability p_{comp} > 0.5 (i.e., better than a
random guess), we prove theoretically that the failure probability of the
proposed algorithm decays to zero exponentially with respect to N and K:
$P(final output is incorrect) le (1 - p_{gen})^N +
lceil log_2 N rceil e^{-2 K (p_{comp} - 0.5)^2}.$ Our empirical
results with the challenging MMLU-Pro benchmark validate the technical
assumptions, as well as the efficacy of the proposed algorithm and the gains
from scaling up its test-time compute.Summary
AI-Generated Summary