Autellix：一个高效的大语言模型代理服务引擎，支持通用程序运行

摘要

大型语言模型（LLM）应用正从简单的聊天机器人演进为动态的通用型智能代理程序，这些程序通过扩展LLM调用和输出令牌，协助AI代理进行推理、探索及解决复杂任务。然而，现有的LLM服务系统忽视了程序与调用间的依赖关系，错失了显著的优化机会。我们的分析表明，提交至LLM服务引擎的程序会经历较长的累计等待时间，这主要源于单个LLM请求及程序层面的队首阻塞问题。为此，我们推出了Autellix，一个将程序视为首要服务对象以最小化其端到端延迟的LLM服务系统。Autellix拦截程序提交的LLM调用，为调度器注入程序级上下文信息。我们提出了两种调度算法——分别针对单线程和分布式程序——它们基于程序已完成调用的历史，对LLM调用进行抢占和优先级排序。评估结果显示，在多种LLM和智能代理工作负载下，与vLLM等先进系统相比，Autellix在相同延迟条件下将程序的吞吐量提升了4至15倍。

English

Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. Our analysis reveals that programs submitted to LLM serving engines experience long cumulative wait times, primarily due to head-of-line blocking at both the individual LLM request and the program. To address this, we introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies. Autellix intercepts LLM calls submitted by programs, enriching schedulers with program-level context. We propose two scheduling algorithms-for single-threaded and distributed programs-that preempt and prioritize LLM calls based on their programs' previously completed calls. Our evaluation demonstrates that across diverse LLMs and agentic workloads, Autellix improves throughput of programs by 4-15x at the same latency compared to state-of-the-art systems, such as vLLM.

Autellix：一个高效的大语言模型代理服务引擎，支持通用程序运行

Autellix: An Efficient Serving Engine for LLM Agents as General Programs

摘要

Summary

Support