睡眠時間計算：超越測試時的推理擴展

摘要

擴展測試時計算已成為使大型語言模型（LLMs）解決難題的關鍵要素，但這也伴隨著高延遲和推理成本。我們引入了休眠時計算，該方法允許模型在查詢提出之前離線“思考”上下文：通過預測用戶可能提出的查詢並預先計算有用的量，我們可以顯著降低測試時的計算需求。為了展示我們方法的有效性，我們創建了兩個推理任務的修改版本——有狀態的GSM-Symbolic和有狀態的AIME。我們發現，休眠時計算可以將達到相同準確度所需的測試時計算量減少約5倍於有狀態的GSM-Symbolic和有狀態的AIME，並且通過擴展休眠時計算，我們可以進一步將準確度提高最多13%於有狀態的GSM-Symbolic和18%於有狀態的AIME。此外，我們引入了多查詢GSM-Symbolic，它通過在每個上下文中包含多個相關查詢來擴展GSM-Symbolic。通過使用多查詢GSM-Symbolic將休眠時計算分攤到同一上下文的多個相關查詢上，我們可以將每查詢的平均成本降低2.5倍。然後，我們進行了額外的分析以了解休眠時計算何時最有效，發現用戶查詢的可預測性與休眠時計算的有效性高度相關。最後，我們進行了一項案例研究，將休眠時計算應用於一個現實的代理軟件工程任務中。

English

Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

睡眠時間計算：超越測試時的推理擴展

Sleep-time Compute: Beyond Inference Scaling at Test-time

摘要

Summary

Support

Support