PRIMA.CPP：加速70B规模大语言模型在低资源日常家用集群上的推理

摘要

DeepSeek R1与QwQ 32B的突破性进展，成功打破了在家庭设备上运行前沿大语言模型（LLMs）的性能瓶颈。尽管消费级硬件日益强大，模型量化技术不断进步，现有的终端解决方案仍依赖于GPU集群、大容量RAM/VRAM及高带宽，远超普通家庭集群的承载能力。本文介绍了prima.cpp，一个分布式推理系统，它能在日常家庭设备上运行70B规模模型，结合CPU/GPU、低RAM/VRAM、Wi-Fi及跨平台支持。该系统利用mmap管理模型权重，并引入带预取的管道环并行机制以隐藏磁盘加载。通过建模计算、通信、磁盘、内存（及其管理行为）及操作系统的异构性，prima.cpp将模型层最优分配给每台设备的CPU和GPU，进一步降低token延迟。为解决这一NP难分配问题，提出了一种优雅的算法——Halda。我们在一个常见的四节点家庭集群上评估了prima.cpp，其在30B+模型上的表现优于llama.cpp、exo和dllama，同时将内存压力控制在6%以下。这使得Llama 3、DeepSeek R1、Qwen 2.5及QwQ等前沿30B-70B模型得以进入家庭助手，真正让先进AI技术触手可及。代码已开源，访问地址为https://github.com/Lizonghang/prima.cpp。

English

Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.

PRIMA.CPP：加速70B规模大语言模型在低资源日常家用集群上的推理

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

摘要

Summary

Support

Support