DeepFlow: 규모 확장이 가능한 서버리스 대형 언어 모델 제공

초록

본 논문은 대규모 언어 모델 (LLM)을 효율적으로 클라우드 환경에서 규모 확장하여 제공하기 위해 설계된 확장 가능하고 서버리스 AI 플랫폼인 DeepFlow를 소개합니다. DeepFlow는 자원 할당, 서비스 효율성, 그리고 콜드 스타트 지연과 같은 주요 도전 과제를 네 가지 주요 설계 구성 요소를 통해 해결합니다. 첫째, AI 워크로드를 관리하는 데 도움이 되는 request-job-task 모델이라는 간단한 서버리스 추상화를 사용합니다. 둘째, LLM 서비스를 최적화하기 위해 마이크로커널 기반 설계, NPU 중심 실행, 그리고 SPMD 기반 병렬성을 활용한 내부 서빙 엔진 FlowServe를 구축합니다. 시스템은 또한 PD-분리 및 PD-공존 구성에 맞게 맞춤형 스케줄링 정책을 포함합니다. 사전 가열된 파드, DRAM 사전 로딩, 그리고 NPU-포크와 같은 최적화를 통해 DeepFlow는 64개의 인스턴스로 초 단위 내에 확장할 수 있습니다. DeepFlow는 1년 이상 운영되어 왔으며 대규모 Ascend NPU 클러스터에서 작동하며 고객에게 세밀한 조정, 에이전트 서빙, 그리고 모델 서빙을 위한 산업 표준 API를 제공하고 있습니다.

English

This paper introduces DeepFlow, a scalable and serverless AI platform designed to efficiently serve large language models (LLMs) at scale in cloud environments. DeepFlow addresses key challenges such as resource allocation, serving efficiency, and cold start latencies through four main design components. First, it uses a simple serverless abstraction called the request-job-task model, which helps manage AI workloads across post-training and model serving tasks. Second, it builds an in-house serving engine FlowServe using a microkernel-inspired design, NPU-centric execution, and SPMD-based parallelism to optimize LLM serving. The system also includes novel scheduling policies tailored for both PD-disaggregated and PD-colocated configurations. With optimizations like pre-warmed pods, DRAM pre-loading, and NPU-fork, DeepFlow can scale up to 64 instances in seconds. DeepFlow has been in production for over a year, operating on a large Ascend NPU cluster and providing industrystandard APIs for fine-tuning, agent serving, and model serving to our customers.

DeepFlow: 규모 확장이 가능한 서버리스 대형 언어 모델 제공

DeepFlow: Serverless Large Language Model Serving at Scale

초록

Summary

Support