PRIMA.CPP:加速70B规模大语言模型在低资源日常家用集群上的推理
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters
April 7, 2025
作者: Zonghang Li, Tao Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu
cs.AI
摘要
DeepSeek R1与QwQ 32B的突破性进展,成功打破了在家庭设备上运行前沿大语言模型(LLMs)的性能瓶颈。尽管消费级硬件日益强大,模型量化技术不断进步,现有的终端解决方案仍依赖于GPU集群、大容量RAM/VRAM及高带宽,远超普通家庭集群的承载能力。本文介绍了prima.cpp,一个分布式推理系统,它能在日常家庭设备上运行70B规模模型,结合CPU/GPU、低RAM/VRAM、Wi-Fi及跨平台支持。该系统利用mmap管理模型权重,并引入带预取的管道环并行机制以隐藏磁盘加载。通过建模计算、通信、磁盘、内存(及其管理行为)及操作系统的异构性,prima.cpp将模型层最优分配给每台设备的CPU和GPU,进一步降低token延迟。为解决这一NP难分配问题,提出了一种优雅的算法——Halda。我们在一个常见的四节点家庭集群上评估了prima.cpp,其在30B+模型上的表现优于llama.cpp、exo和dllama,同时将内存压力控制在6%以下。这使得Llama 3、DeepSeek R1、Qwen 2.5及QwQ等前沿30B-70B模型得以进入家庭助手,真正让先进AI技术触手可及。代码已开源,访问地址为https://github.com/Lizonghang/prima.cpp。
English
Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers
for running frontier large language models (LLMs) on home devices. While
consumer hardware is getting stronger and model quantization is improving,
existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high
bandwidth, far beyond what a common home cluster can handle. This paper
introduces prima.cpp, a distributed inference system that runs 70B-scale models
on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and
cross-platform support. It uses mmap to manage model weights and introduces
piped-ring parallelism with prefetching to hide disk loading. By modeling
heterogeneity in computation, communication, disk, memory (and its management
behavior), and OS, it optimally assigns model layers to each device's CPU and
GPU, further reducing token latency. An elegant algorithm named Halda is
proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a
common four-node home cluster. It outperforms llama.cpp, exo, and dllama on
30B+ models while keeping memory pressure below 6%. This brings frontier
30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home
assistants, making advanced AI truly accessible to individuals. The code is
open source and available at https://github.com/Lizonghang/prima.cpp.Summary
AI-Generated Summary