通信高效的语言模型训练展现出可靠且稳健的扩展性：DiLoCo的扩展法则

摘要

随着机器学习模型规模不断扩大，数据并行方法中固有的频繁同步需求导致了显著的性能下降，这对进一步扩展构成了关键挑战。近期研究提出了一种名为DiLoCo的方法，它在不牺牲模型质量的前提下放松了同步需求。然而，这些研究并未深入分析DiLoCo的行为如何随模型规模变化。在本研究中，我们探讨了在固定计算预算下训练大型语言模型（LLMs）时，DiLoCo的扩展规律。我们重点分析了算法因素，包括模型副本数量、超参数及令牌预算，如何以可预测的方式影响训练过程，这些影响可通过扩展定律准确预测。我们发现，DiLoCo在模型规模上的扩展既具可预测性又稳健。经过良好调优后，DiLoCo在模型规模上的扩展优于数据并行训练，甚至在小型模型上也能超越数据并行训练。我们的研究结果揭示了DiLoCo相较于先前记录更为广泛的优势，包括增大的最优批量大小、随规模提升的下游泛化能力，以及在固定令牌预算下评估损失的改善。

English

As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.

通信高效的语言模型训练展现出可靠且稳健的扩展性：DiLoCo的扩展法则

Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

摘要

Summary

Support