DeMo: 解耦动量优化

摘要

训练大型神经网络通常需要通过专门的高速互连在加速器之间共享梯度。借鉴信号处理原理中的频率分解和能量压缩，我们展示了在训练过程中同步完整的优化器状态和模型参数是不必要的。通过解耦动量更新，并允许加速器之间优化器状态的受控分歧，我们实现了比最先进的优化器更好的收敛性。我们引入了{解耦}动量（DeMo），这是一个融合了优化器和数据并行算法，可以将加速器之间的通信需求降低数个数量级。这使得即使在网络带宽有限和硬件异构的情况下，也能训练大型神经网络。我们的方法与拓扑无关、与架构无关，并支持可扩展的时钟同步分布式训练，计算和内存开销可以忽略不计。实证结果表明，使用DeMo训练的模型在性能上与使用AdamW训练的等效模型相匹敌甚至超越，同时在预训练大规模基础模型时无需高速互连。我们在GitHub上发布了开源的基于PyTorch的参考实现，网址为https://github.com/bloc97/DeMo。

English

Training large neural networks typically requires sharing gradients between accelerators through specialized high-speed interconnects. Drawing from the signal processing principles of frequency decomposition and energy compaction, we demonstrate that synchronizing full optimizer states and model parameters during training is unnecessary. By decoupling momentum updates and allowing controlled divergence in optimizer states across accelerators, we achieve improved convergence compared to state-of-the-art optimizers. We introduce {De}coupled {Mo}mentum (DeMo), a fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude. This enables training of large neural networks even with limited network bandwidth and heterogeneous hardware. Our method is topology-agnostic and architecture-independent and supports scalable clock-synchronous distributed training with negligible compute and memory overhead. Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW, while eliminating the need for high-speed interconnects when pre-training large scale foundation models. An open source reference PyTorch implementation is published on GitHub at https://github.com/bloc97/DeMo

DeMo: 解耦动量优化

DeMo: Decoupled Momentum Optimization

摘要

Summary

Support

Support