睡眠时间计算:超越测试时的推理扩展
Sleep-time Compute: Beyond Inference Scaling at Test-time
April 17, 2025
作者: Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, Joseph E. Gonzalez
cs.AI
摘要
扩展测试时计算已成为大型语言模型(LLMs)解决复杂问题的关键要素,但这也带来了高延迟和推理成本。我们引入了休眠时计算,使模型能够在查询提出之前离线“思考”上下文:通过预测用户可能提出的问题并预先计算有用信息,我们能够显著降低测试时的计算需求。为验证该方法的有效性,我们创建了两个推理任务的改进版本——状态保持型GSM-Symbolic和状态保持型AIME。研究发现,休眠时计算可将达到相同准确率所需的测试时计算量减少约5倍,在状态保持型GSM-Symbolic和状态保持型AIME上分别提升准确率最高达13%和18%。此外,我们提出了多查询GSM-Symbolic,它通过在每个上下文中包含多个相关查询来扩展GSM-Symbolic。利用多查询GSM-Symbolic,将休眠时计算分摊到同一上下文的相关查询上,可使每个查询的平均成本降低2.5倍。随后,我们进行了进一步分析,以了解休眠时计算何时最为有效,发现用户查询的可预测性与休眠时计算的效果高度相关。最后,我们通过案例研究,将休眠时计算应用于一个现实的自主软件工程任务中。
English
Scaling test-time compute has emerged as a key ingredient for enabling large
language models (LLMs) to solve difficult problems, but comes with high latency
and inference cost. We introduce sleep-time compute, which allows models to
"think" offline about contexts before queries are presented: by anticipating
what queries users might ask and pre-computing useful quantities, we can
significantly reduce the compute requirements at test-time. To demonstrate the
efficacy of our method, we create modified versions of two reasoning tasks -
Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can
reduce the amount of test-time compute needed to achieve the same accuracy by ~
5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time
compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic
and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic,
which extends GSM-Symbolic by including multiple related queries per context.
By amortizing sleep-time compute across related queries about the same context
using Multi-Query GSM-Symbolic, we can decrease the average cost per query by
2.5x. We then conduct additional analysis to understand when sleep-time compute
is most effective, finding the predictability of the user query to be well
correlated with the efficacy of sleep-time compute. Finally, we conduct a
case-study of applying sleep-time compute to a realistic agentic SWE task.Summary
AI-Generated Summary