DeMo:解耦動量優化
DeMo: Decoupled Momentum Optimization
November 29, 2024
作者: Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma
cs.AI
摘要
訓練大型神經網絡通常需要透過專用高速互連來在加速器之間共享梯度。借鑒信號處理原理中的頻率分解和能量壓縮,我們證明在訓練期間同步完整的優化器狀態和模型參數是不必要的。通過解耦動量更新,並允許加速器之間優化器狀態的受控發散,我們實現了比最先進的優化器更好的收斂性能。我們引入了{解耦動量}(DeMo),這是一種融合優化器和數據並行算法,可以將加速器之間的通信需求降低數個數量級。這使得即使在網絡帶寬有限且硬件異構的情況下,也能訓練大型神經網絡。我們的方法與拓撲無關,與架構無關,支持可擴展的時鐘同步分佈式訓練,計算和內存開銷微不足道。實證結果表明,使用DeMo訓練的模型與使用AdamW訓練的等效模型在性能上相當甚至超越,同時消除了在預訓練大型基礎模型時需要高速互連的需求。我們在GitHub上發布了開源的參考PyTorch實現,網址為https://github.com/bloc97/DeMo
English
Training large neural networks typically requires sharing gradients between
accelerators through specialized high-speed interconnects. Drawing from the
signal processing principles of frequency decomposition and energy compaction,
we demonstrate that synchronizing full optimizer states and model parameters
during training is unnecessary. By decoupling momentum updates and allowing
controlled divergence in optimizer states across accelerators, we achieve
improved convergence compared to state-of-the-art optimizers. We introduce
{De}coupled {Mo}mentum (DeMo), a fused optimizer and data
parallel algorithm that reduces inter-accelerator communication requirements by
several orders of magnitude. This enables training of large neural networks
even with limited network bandwidth and heterogeneous hardware. Our method is
topology-agnostic and architecture-independent and supports scalable
clock-synchronous distributed training with negligible compute and memory
overhead. Empirical results show that models trained with DeMo match or exceed
the performance of equivalent models trained with AdamW, while eliminating the
need for high-speed interconnects when pre-training large scale foundation
models. An open source reference PyTorch implementation is published on GitHub
at https://github.com/bloc97/DeMoSummary
AI-Generated Summary