MARS: 대규모 모델 훈련을 위한 분산 감소의 힘 발휘

초록

딥 신경망 및 더 최근에는 대규모 모델을 훈련하는 데는 효율적이고 확장 가능한 옵티마이저가 필요합니다. Adam, AdamW 및 그 변형과 같은 적응형 그래디언트 알고리즘은 이 작업에 중요한 역할을 해왔습니다. 지난 10년 동안 다양한 분산 감소 알고리즘들이 발전되었지만, 볼록 및 비볼록 설정에서 확률적 최적화를 가속화하기 위한 것들이었으나, 분산 감소는 딥 신경망이나 대규모 언어 모델을 훈련하는 데 널리 성공을 거두지 못했습니다. 결과적으로, 현대 AI에서는 이 방법이 상대적으로 선호되지 않는 접근 방식으로 남아 있었습니다. 본 논문에서는 대규모 모델의 효율적인 훈련을 위해 분산 감소의 힘을 발휘하기 위해 사전 조건화된 그래디언트 방법과 스케일 조정된 확률적 재귀 모멘텀 기술을 통해 분산 감소를 조화시키는 통합된 최적화 프레임워크인 MARS (Make vAriance Reduction Shine)를 제안합니다. 우리의 프레임워크 내에서, AdamW, Lion 및 Shampoo를 기반으로 한 사전 조건화된 그래디언트 업데이트를 활용하는 MARS의 세 가지 인스턴스를 소개합니다. 또한 우리의 알고리즘과 기존 옵티마이저 사이의 연결을 설명합니다. GPT-2 모델을 훈련하는 실험 결과는 MARS가 AdamW보다 큰 폭으로 우수한 성능을 보인다는 것을 지속적으로 입증합니다.

English

Training deep neural networks--and more recently, large models--demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.

MARS: 대규모 모델 훈련을 위한 분산 감소의 힘 발휘

MARS: Unleashing the Power of Variance Reduction for Training Large Models

초록

Support