在词汇并行性和管道并行性之间取得平衡
Balancing Pipeline Parallelism with Vocabulary Parallelism
November 8, 2024
作者: Man Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan
cs.AI
摘要
管道并行性被广泛应用于扩展基于Transformer的大型语言模型的训练,已经进行了各种工作来提高其吞吐量和内存占用。本文解决了一个经常被忽视的问题:词汇层可能导致管道阶段之间的计算和内存使用不平衡,加剧了管道气泡和内存瓶颈。为了解决这个问题,我们将词汇层均匀地划分到管道设备上,并将计算分组为管道传递。为了减少激活内存开销,我们提出了几种算法来减少词汇层内的通信障碍。此外,我们利用一种通用方法将词汇并行性与现有的管道调度集成在一起。通过结合这些技术,我们的方法有效地平衡了计算和参数内存,仅有少量恒定的激活内存开销。值得注意的是,当与像V-Half这样的激活内存平衡调度结合时,我们的方法在内存和计算方面实现了完美的平衡。广泛的评估表明,我们的方法实现了计算和内存的平衡,无论词汇量大小如何,与朴素方法相比,吞吐量提高了5%至51%,同时显著减少了尤其是对于大词汇量场景的峰值内存使用。我们的实现已在https://github.com/sail-sg/VocabularyParallelism 开源。
English
Pipeline parallelism is widely used to scale the training of
transformer-based large language models, various works have been done to
improve its throughput and memory footprint. In this paper, we address a
frequently overlooked issue: the vocabulary layers can cause imbalanced
computation and memory usage across pipeline stages, worsening pipeline bubbles
and the memory bottleneck. To tackle this, we partition the vocabulary layers
evenly across pipeline devices and group the computation into pipeline passes.
To reduce the activation memory overhead, we propose several algorithms to
reduce communication barriers within vocabulary layers. Additionally, we
utilize a generalizable method to integrate Vocabulary Parallelism with
existing pipeline schedules. By combining these techniques, our methods
effectively balance the computation and parameter memory, with only a small
constant activation memory overhead. Notably, when combined with activation
memory-balanced schedules like V-Half, our approach achieves perfect balance in
both memory and computation. Extensive evaluations demonstrate that our method
achieves computation and memory balance regardless of the vocabulary size,
resulting in a 5% to 51% improvement in throughput compared to naive
approaches, meanwhile significantly reducing peak memory usage especially for
large vocabulary scenarios. Our implementation is open-sourced at
https://github.com/sail-sg/VocabularyParallelism .Summary
AI-Generated Summary