PRIMA.CPP：在低資源日常家用集群上加速700億規模大語言模型推理

摘要

DeepSeek R1 和 QwQ 32B 的涌现，突破了在家庭设备上运行前沿大型语言模型（LLMs）的性能瓶颈。尽管消费级硬件日益强大，模型量化技术也在不断进步，但现有的终端解决方案仍需要 GPU 集群、大容量 RAM/VRAM 和高带宽，远超出普通家庭集群的处理能力。本文介绍了 prima.cpp，一个分布式推理系统，它能够在日常家庭设备上运行 70B 规模的模型，利用 CPU/GPU 混合计算、低 RAM/VRAM、Wi-Fi 和跨平台支持。该系统采用 mmap 管理模型权重，并引入带预取的管道环形并行机制以隐藏磁盘加载。通过建模计算、通信、磁盘、内存（及其管理行为）和操作系统的异构性，它优化地将模型层分配给每个设备的 CPU 和 GPU，进一步降低令牌延迟。我们提出了一种名为 Halda 的优雅算法来解决这一 NP 难分配问题。我们在一个常见的四节点家庭集群上评估了 prima.cpp，在 30B+ 模型上表现优于 llama.cpp、exo 和 dllama，同时将内存压力保持在 6% 以下。这使得 Llama 3、DeepSeek R1、Qwen 2.5 和 QwQ 等前沿 30B-70B 模型能够进入家庭助手，真正让个人用户触手可及。代码已开源，可在 https://github.com/Lizonghang/prima.cpp 获取。

English

Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.

PRIMA.CPP：在低資源日常家用集群上加速700億規模大語言模型推理

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

摘要

Summary

Support

Support