ChatPaper.aiChatPaper

具有重叠通信的流式分布式局部对比度编码:走向分布式免费午餐

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

January 30, 2025
作者: Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey, Ross McIlroy, Jiajun Shen, Alexandre Ramé, Arthur Szlam, Marc'Aurelio Ranzato, Paul Barham
cs.AI

摘要

大型语言模型(LLMs)的训练通常分布在大量加速器上,以缩短训练时间。由于在每个梯度步骤中需要交换内部状态和参数梯度,因此所有设备都需要通过低延迟高带宽的通信链路进行互连,以支持所需的大量交换位数。最近,像DiLoCo这样的分布式算法已经放宽了这种互连约束:加速器可以分组为“工作者”,其中工作者之间的同步只会偶尔发生。这意味着工作者可以通过带宽较低的通信链路连接,而不会影响学习质量。然而,在这些方法中,跨工作者的通信仍然需要与以前一样的峰值带宽,因为同步需要在所有工作者之间交换所有参数。在本文中,我们以三种方式改进了DiLoCo。首先,我们仅按顺序同步参数的子集,而不是一次性同步所有参数,这大大降低了峰值带宽。其次,我们允许工作者在同步时继续训练,从而减少了挂钟时间。第三,我们对工作者交换的数据进行量化,进一步降低了工作者之间的带宽。通过正确结合这些修改,我们实验证明可以分布式训练数十亿规模的参数,并达到与以前相似的质量水平,但将所需带宽降低了两个数量级。
English
Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.

Summary

AI-Generated Summary

PDF287January 31, 2025