PipeOffload：通过内存优化提升管道并行化的可扩展性

摘要

流水线并行（Pipeline Parallelism, PP）在大型语言模型（LLMs）训练中广泛应用，然而其扩展性常受限于高激活内存消耗，因为随着PP程度的增加，飞行中的微批次数量也随之增长。本文聚焦于通过挖掘PP中尚未充分探索的内存卸载策略来应对这一挑战。通过实证研究，我们发现，在大多数标准配置下，至少一半甚至全部的激活数据可被卸载，且开销可忽略不计。在无法实现完全卸载的情况下，我们提出了一种新颖的选择性卸载策略，该策略以优于线性的方式降低峰值激活内存。此外，我们将内存卸载与其他技术相结合，综合考虑整体吞吐量与内存限制。实验证明，每设备的激活内存随着总阶段数的增加而有效减少，使PP成为比张量并行（TP）更具优势的选择，在内存消耗更低的情况下，最高可带来19%的加速。相关实现已开源，详见https://github.com/sail-sg/zero-bubble-pipeline-parallelism{此链接}。

English

Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at https://github.com/sail-sg/zero-bubble-pipeline-parallelism{this url}.

PipeOffload：通过内存优化提升管道并行化的可扩展性

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

摘要

Summary

Support

Support