DeMo：解耦動量優化

摘要

訓練大型神經網絡通常需要透過專用高速互連來在加速器之間共享梯度。借鑒信號處理原理中的頻率分解和能量壓縮，我們證明在訓練期間同步完整的優化器狀態和模型參數是不必要的。通過解耦動量更新，並允許加速器之間優化器狀態的受控發散，我們實現了比最先進的優化器更好的收斂性能。我們引入了{解耦動量}（DeMo），這是一種融合優化器和數據並行算法，可以將加速器之間的通信需求降低數個數量級。這使得即使在網絡帶寬有限且硬件異構的情況下，也能訓練大型神經網絡。我們的方法與拓撲無關，與架構無關，支持可擴展的時鐘同步分佈式訓練，計算和內存開銷微不足道。實證結果表明，使用DeMo訓練的模型與使用AdamW訓練的等效模型在性能上相當甚至超越，同時消除了在預訓練大型基礎模型時需要高速互連的需求。我們在GitHub上發布了開源的參考PyTorch實現，網址為https://github.com/bloc97/DeMo

English

Training large neural networks typically requires sharing gradients between accelerators through specialized high-speed interconnects. Drawing from the signal processing principles of frequency decomposition and energy compaction, we demonstrate that synchronizing full optimizer states and model parameters during training is unnecessary. By decoupling momentum updates and allowing controlled divergence in optimizer states across accelerators, we achieve improved convergence compared to state-of-the-art optimizers. We introduce {De}coupled {Mo}mentum (DeMo), a fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude. This enables training of large neural networks even with limited network bandwidth and heterogeneous hardware. Our method is topology-agnostic and architecture-independent and supports scalable clock-synchronous distributed training with negligible compute and memory overhead. Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW, while eliminating the need for high-speed interconnects when pre-training large scale foundation models. An open source reference PyTorch implementation is published on GitHub at https://github.com/bloc97/DeMo

DeMo：解耦動量優化

DeMo: Decoupled Momentum Optimization

摘要

Support