TPI-LLM：在低資源邊緣設備上高效服務70B規模的LLM

摘要

由於對用戶互動數據隱私的擔憂，大型模型推理正從雲端轉向邊緣。然而，邊緣設備通常面臨著計算能力、內存和帶寬有限的問題，需要跨多個設備進行協作以運行和加速大型模型推理。管道並行，作為主流解決方案，對於單用戶場景效率低下，而張量並行則在頻繁通信方面遇到困難。本文主張在低資源設備上，張量並行可能比管道更有效，並提出了一個計算和內存高效的張量並行推理系統，名為TPI-LLM，以服務70B級模型。TPI-LLM將敏感原始數據保留在用戶設備本地，並引入滑動窗口內存調度器，在推理期間動態管理層權重，使磁盤I/O延遲與計算和通信重疊。這使得更大型的模型能夠在內存有限的設備上平穩運行。我們分析了通信瓶頸，發現鏈路延遲而非帶寬成為主要問題，因此實施了基於星型的全局歸納算法。通過對模擬和實際測試平臺進行廣泛實驗，TPI-LLM相比於Accelerate，首個令牌時間和令牌延遲減少了80%，相比於Transformers和Galaxy減少了90%，同時將Llama 2-70B的峰值內存占用減少了90%，僅需要3.1 GB的內存來運行70B級模型。

English

Large model inference is shifting from cloud to edge due to concerns about the privacy of user interaction data. However, edge devices often struggle with limited computing power, memory, and bandwidth, requiring collaboration across multiple devices to run and speed up LLM inference. Pipeline parallelism, the mainstream solution, is inefficient for single-user scenarios, while tensor parallelism struggles with frequent communications. In this paper, we argue that tensor parallelism can be more effective than pipeline on low-resource devices, and present a compute- and memory-efficient tensor parallel inference system, named TPI-LLM, to serve 70B-scale models. TPI-LLM keeps sensitive raw data local in the users' devices and introduces a sliding window memory scheduler to dynamically manage layer weights during inference, with disk I/O latency overlapped with the computation and communication. This allows larger models to run smoothly on memory-limited devices. We analyze the communication bottleneck and find that link latency, not bandwidth, emerges as the main issue, so a star-based allreduce algorithm is implemented. Through extensive experiments on both emulated and real testbeds, TPI-LLM demonstrated over 80% less time-to-first-token and token latency compared to Accelerate, and over 90% compared to Transformers and Galaxy, while cutting the peak memory footprint of Llama 2-70B by 90%, requiring only 3.1 GB of memory for 70B-scale models.

TPI-LLM：在低資源邊緣設備上高效服務70B規模的LLM

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

摘要

Summary

Support

Support