利用Certaindex高效地为LLM推理程序提供服务
Efficiently Serving LLM Reasoning Programs with Certaindex
December 30, 2024
作者: Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Aurick Qiao, Hao Zhang
cs.AI
摘要
大型语言模型(LLMs)的快速演进已经解锁了它们在高级推理任务中的能力,如数学问题求解、代码生成和法律分析。这一进展的核心是推理时的推理算法,通过探索多条解决路径来优化输出,但这会增加计算需求和响应延迟。现有的服务系统无法适应这些算法的扩展行为或查询难度的变化,导致资源利用效率低下,无法满足延迟目标。
我们提出了Dynasor,这是一个为LLM推理查询优化推理时计算的系统。与传统引擎不同,Dynasor跟踪和调度推理查询中的请求,并使用Certaindex,一种基于模型确定性衡量统计推理进展的代理,动态地指导计算分配。Dynasor将调度与推理进展相互适应:它为难度较大的查询分配更多计算资源,减少简单查询的计算量,并及早终止无前途的查询,平衡准确性、延迟和成本。在各种数据集和算法上,Dynasor在批处理中将计算量减少了高达50%,并在在线服务中保持了3.3倍更高的查询速率或4.7倍更紧密的延迟SLOs。
English
The rapid evolution of large language models (LLMs) has unlocked their
capabilities in advanced reasoning tasks like mathematical problem-solving,
code generation, and legal analysis. Central to this progress are
inference-time reasoning algorithms, which refine outputs by exploring multiple
solution paths, at the cost of increasing compute demands and response
latencies. Existing serving systems fail to adapt to the scaling behaviors of
these algorithms or the varying difficulty of queries, leading to inefficient
resource use and unmet latency targets.
We present Dynasor, a system that optimizes inference-time compute for LLM
reasoning queries. Unlike traditional engines, Dynasor tracks and schedules
requests within reasoning queries and uses Certaindex, a proxy that measures
statistical reasoning progress based on model certainty, to guide compute
allocation dynamically. Dynasor co-adapts scheduling with reasoning progress:
it allocates more compute to hard queries, reduces compute for simpler ones,
and terminates unpromising queries early, balancing accuracy, latency, and
cost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50%
in batch processing and sustaining 3.3x higher query rates or 4.7x tighter
latency SLOs in online serving.Summary
AI-Generated Summary