睡眠時間計算:超越測試時的推理擴展
Sleep-time Compute: Beyond Inference Scaling at Test-time
April 17, 2025
作者: Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, Joseph E. Gonzalez
cs.AI
摘要
擴展測試時計算已成為使大型語言模型(LLMs)解決難題的關鍵要素,但這也伴隨著高延遲和推理成本。我們引入了休眠時計算,該方法允許模型在查詢提出之前離線“思考”上下文:通過預測用戶可能提出的查詢並預先計算有用的量,我們可以顯著降低測試時的計算需求。為了展示我們方法的有效性,我們創建了兩個推理任務的修改版本——有狀態的GSM-Symbolic和有狀態的AIME。我們發現,休眠時計算可以將達到相同準確度所需的測試時計算量減少約5倍於有狀態的GSM-Symbolic和有狀態的AIME,並且通過擴展休眠時計算,我們可以進一步將準確度提高最多13%於有狀態的GSM-Symbolic和18%於有狀態的AIME。此外,我們引入了多查詢GSM-Symbolic,它通過在每個上下文中包含多個相關查詢來擴展GSM-Symbolic。通過使用多查詢GSM-Symbolic將休眠時計算分攤到同一上下文的多個相關查詢上,我們可以將每查詢的平均成本降低2.5倍。然後,我們進行了額外的分析以了解休眠時計算何時最有效,發現用戶查詢的可預測性與休眠時計算的有效性高度相關。最後,我們進行了一項案例研究,將休眠時計算應用於一個現實的代理軟件工程任務中。
English
Scaling test-time compute has emerged as a key ingredient for enabling large
language models (LLMs) to solve difficult problems, but comes with high latency
and inference cost. We introduce sleep-time compute, which allows models to
"think" offline about contexts before queries are presented: by anticipating
what queries users might ask and pre-computing useful quantities, we can
significantly reduce the compute requirements at test-time. To demonstrate the
efficacy of our method, we create modified versions of two reasoning tasks -
Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can
reduce the amount of test-time compute needed to achieve the same accuracy by ~
5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time
compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic
and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic,
which extends GSM-Symbolic by including multiple related queries per context.
By amortizing sleep-time compute across related queries about the same context
using Multi-Query GSM-Symbolic, we can decrease the average cost per query by
2.5x. We then conduct additional analysis to understand when sleep-time compute
is most effective, finding the predictability of the user query to be well
correlated with the efficacy of sleep-time compute. Finally, we conduct a
case-study of applying sleep-time compute to a realistic agentic SWE task.Summary
AI-Generated Summary