ZClip:面向大语言模型预训练的自适应峰值抑制技术
ZClip: Adaptive Spike Mitigation for LLM Pre-Training
April 3, 2025
作者: Abhay Kumar, Louis Owen, Nilabhra Roy Chowdhury, Fabian Güra
cs.AI
摘要
训练大型语言模型(LLMs)面临诸多挑战,其中包括梯度不稳定性和损失值突增。这些现象可能导致灾难性发散,需要昂贵的检查点恢复和数据批次跳过操作。传统的梯度裁剪技术,如固定值或基于范数的方法,由于依赖固定阈值或启发式规则,无法有效解决这些问题,导致学习效率低下且需要频繁的人工干预。在本研究中,我们提出了ZClip,一种自适应梯度裁剪算法,它根据梯度范数随时间的统计特性动态调整裁剪阈值。与以往的反应式策略不同,ZClip无需对梯度范数的规模及时间演变做出任何先验假设,便能主动适应训练动态。其核心在于利用基于z分数的异常检测来识别并缓解大幅梯度突增,从而防止恶性损失值突增,同时不影响模型的正常收敛。我们的代码已公开于:https://github.com/bluorion-com/ZClip。
English
Training large language models (LLMs) presents numerous challenges, including
gradient instability and loss spikes. These phenomena can lead to catastrophic
divergence, requiring costly checkpoint restoration and data batch skipping.
Traditional gradient clipping techniques, such as constant or norm-based
methods, fail to address these issues effectively due to their reliance on
fixed thresholds or heuristics, leading to inefficient learning and requiring
frequent manual intervention. In this work, we propose ZClip, an adaptive
gradient clipping algorithm that dynamically adjusts the clipping threshold
based on statistical properties of gradient norms over time. Unlike prior
reactive strategies, ZClip proactively adapts to training dynamics without
making any prior assumptions on the scale and the temporal evolution of
gradient norms. At its core, it leverages z-score-based anomaly detection to
identify and mitigate large gradient spikes, preventing malignant loss spikes
while not interfering with convergence otherwise. Our code is available at:
https://github.com/bluorion-com/ZClip.Summary
AI-Generated Summary