使用Certaindex高效地服務LLM推理程序

摘要

大型語言模型（LLMs）的快速演進已經展示了它們在高級推理任務中的能力，如數學問題解決、程式碼生成和法律分析。這一進展的核心是推理時間推理算法，通過探索多個解決方案路徑來優化輸出，但這會增加計算需求和響應延遲。現有的服務系統無法適應這些算法的擴展行為或查詢的不同難度，導致資源使用效率低下且無法滿足延遲目標。我們提出了Dynasor，這是一個針對LLM推理查詢優化推理時間計算的系統。與傳統引擎不同，Dynasor在推理查詢內部跟蹤和安排請求，並使用Certaindex，一個基於模型確定性測量統計推理進度的代理，來動態指導計算分配。Dynasor通過與推理進度共同適應安排：它為困難的查詢分配更多計算資源，為簡單的查詢減少計算資源，並及早終止無前途的查詢，平衡準確性、延遲和成本。在各種數據集和算法上，Dynasor在批處理中將計算資源減少了多達50％，同時在線服務中維持了3.3倍更高的查詢速率或4.7倍更緊湊的延遲SLOs。

English

The rapid evolution of large language models (LLMs) has unlocked their capabilities in advanced reasoning tasks like mathematical problem-solving, code generation, and legal analysis. Central to this progress are inference-time reasoning algorithms, which refine outputs by exploring multiple solution paths, at the cost of increasing compute demands and response latencies. Existing serving systems fail to adapt to the scaling behaviors of these algorithms or the varying difficulty of queries, leading to inefficient resource use and unmet latency targets. We present Dynasor, a system that optimizes inference-time compute for LLM reasoning queries. Unlike traditional engines, Dynasor tracks and schedules requests within reasoning queries and uses Certaindex, a proxy that measures statistical reasoning progress based on model certainty, to guide compute allocation dynamically. Dynasor co-adapts scheduling with reasoning progress: it allocates more compute to hard queries, reduces compute for simpler ones, and terminates unpromising queries early, balancing accuracy, latency, and cost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50% in batch processing and sustaining 3.3x higher query rates or 4.7x tighter latency SLOs in online serving.

使用Certaindex高效地服務LLM推理程序

Efficiently Serving LLM Reasoning Programs with Certaindex

摘要

Summary

Support