DiLoCo에서 중첩된 통신과 계산을 위한 적극적인 업데이트

초록

DiLoCo와 같은 분산 최적화 방법은 데이터센터와 같은 다수의 분산 작업자들 간에 매우 큰 모델을 효과적으로 학습시키는 데 유용한 것으로 입증되었습니다. 이러한 방법은 업데이트를 두 부분으로 나눕니다: 내부 최적화 단계에서는 각 작업자가 자신의 로컬 데이터에 대해 독립적으로 여러 최적화 단계를 실행하고, 외부 최적화 단계에서는 내부 업데이트를 동기화합니다. 이러한 접근 방식은 표준 데이터 병렬 학습에 비해 통신량을 크게 줄이지만, 작업자들이 데이터센터인 환경에서는 외부 최적화 단계마다 필요한 블로킹으로 인해 제한된 통신 요구 사항조차도 상당한 속도 저하를 초래할 수 있습니다. 본 논문에서는 외부 최적화 단계가 내부 최적화 단계와 완전히 겹치도록 통신과 계산을 중첩시키는 기법을 통해 이 문제를 완화하는 방법을 탐구합니다. 우리는 특히 'eager updates'라고 명명한 특정 변형이 작업자 간의 낮은 대역폭 환경에서도 표준 DiLoCo와 경쟁력 있는 성능을 제공함을 보여줍니다.

English

Distributed optimization methods such as DiLoCo have been shown to be effective in training very large models across multiple distributed workers, such as datacenters. These methods split updates into two parts: an inner optimization phase, where the workers independently execute multiple optimization steps on their own local data, and an outer optimization step, where the inner updates are synchronized. While such approaches require orders of magnitude less communication than standard data-parallel training, in settings where the workers are datacenters, even the limited communication requirements of these approaches can still cause significant slow downs due to the blocking necessary at each outer optimization step. In this paper, we investigate techniques to mitigate this issue by overlapping communication with computation in a manner that allows the outer optimization step to fully overlap with the inner optimization phase. We show that a particular variant, dubbed eager updates, provides competitive performance with standard DiLoCo in settings with low bandwidth between workers.

DiLoCo에서 중첩된 통신과 계산을 위한 적극적인 업데이트

Eager Updates For Overlapped Communication and Computation in DiLoCo

초록

Summary

Support