利用Certaindex高效地为LLM推理程序提供服务

摘要

大型语言模型（LLMs）的快速演进已经解锁了它们在高级推理任务中的能力，如数学问题求解、代码生成和法律分析。这一进展的核心是推理时的推理算法，通过探索多条解决路径来优化输出，但这会增加计算需求和响应延迟。现有的服务系统无法适应这些算法的扩展行为或查询难度的变化，导致资源利用效率低下，无法满足延迟目标。我们提出了Dynasor，这是一个为LLM推理查询优化推理时计算的系统。与传统引擎不同，Dynasor跟踪和调度推理查询中的请求，并使用Certaindex，一种基于模型确定性衡量统计推理进展的代理，动态地指导计算分配。Dynasor将调度与推理进展相互适应：它为难度较大的查询分配更多计算资源，减少简单查询的计算量，并及早终止无前途的查询，平衡准确性、延迟和成本。在各种数据集和算法上，Dynasor在批处理中将计算量减少了高达50％，并在在线服务中保持了3.3倍更高的查询速率或4.7倍更紧密的延迟SLOs。

English

The rapid evolution of large language models (LLMs) has unlocked their capabilities in advanced reasoning tasks like mathematical problem-solving, code generation, and legal analysis. Central to this progress are inference-time reasoning algorithms, which refine outputs by exploring multiple solution paths, at the cost of increasing compute demands and response latencies. Existing serving systems fail to adapt to the scaling behaviors of these algorithms or the varying difficulty of queries, leading to inefficient resource use and unmet latency targets. We present Dynasor, a system that optimizes inference-time compute for LLM reasoning queries. Unlike traditional engines, Dynasor tracks and schedules requests within reasoning queries and uses Certaindex, a proxy that measures statistical reasoning progress based on model certainty, to guide compute allocation dynamically. Dynasor co-adapts scheduling with reasoning progress: it allocates more compute to hard queries, reduces compute for simpler ones, and terminates unpromising queries early, balancing accuracy, latency, and cost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50% in batch processing and sustaining 3.3x higher query rates or 4.7x tighter latency SLOs in online serving.

利用Certaindex高效地为LLM推理程序提供服务

Efficiently Serving LLM Reasoning Programs with Certaindex

摘要

Support