DeepFlow：规模化无服务器大型语言模型服务

摘要

本文介绍了DeepFlow，这是一个可扩展且无服务器的人工智能平台，旨在在云环境中高效地为大型语言模型（LLMs）提供服务。DeepFlow通过四个主要设计组件解决了资源分配、服务效率和冷启动延迟等关键挑战。首先，它使用了一个简单的无服务器抽象，称为请求-作业-任务模型，有助于管理人工智能工作负载跨越后训练和模型服务任务。其次，它构建了一个内部服务引擎FlowServe，采用微内核设计、以NPU为中心的执行和基于SPMD的并行性，以优化LLM的服务。该系统还包括针对PD-分离和PD-共置配置量身定制的新型调度策略。通过预热的Pod、DRAM预加载和NPU分叉等优化，DeepFlow可以在几秒钟内扩展到64个实例。DeepFlow已经投入生产超过一年，运行在一个大型Ascend NPU集群上，并为我们的客户提供了行业标准的API，用于微调、代理服务和模型服务。

English

This paper introduces DeepFlow, a scalable and serverless AI platform designed to efficiently serve large language models (LLMs) at scale in cloud environments. DeepFlow addresses key challenges such as resource allocation, serving efficiency, and cold start latencies through four main design components. First, it uses a simple serverless abstraction called the request-job-task model, which helps manage AI workloads across post-training and model serving tasks. Second, it builds an in-house serving engine FlowServe using a microkernel-inspired design, NPU-centric execution, and SPMD-based parallelism to optimize LLM serving. The system also includes novel scheduling policies tailored for both PD-disaggregated and PD-colocated configurations. With optimizations like pre-warmed pods, DRAM pre-loading, and NPU-fork, DeepFlow can scale up to 64 instances in seconds. DeepFlow has been in production for over a year, operating on a large Ascend NPU cluster and providing industrystandard APIs for fine-tuning, agent serving, and model serving to our customers.

DeepFlow：规模化无服务器大型语言模型服务

DeepFlow: Serverless Large Language Model Serving at Scale

摘要

Summary

Support