LASP-2: 선형 어텐션 및 그 하이브리드를 위한 시퀀스 병렬 처리 재고하기

초록

선형 시퀀스 모델링 방법인 선형 어텐션은 시퀀스 길이에 대해 선형 시간 훈련 및 일정한 메모리 추론과 같은 이점을 제공합니다. 그러나 기존의 시퀀스 병렬화(SP) 방법은 선형 어텐션의 오른쪽-제품-먼저 특성에 최적화되지 않았거나 링-스타일 통신 전략을 사용하여 계산 병렬화를 제한하여 분산 시스템에서 더 긴 시퀀스에 대한 확장성을 제한합니다. 본 논문에서는 매우 긴 입력 시퀀스로 선형 어텐션 트랜스포머 모델을 훈련할 때 통신 및 계산 병렬화를 향상시키기 위한 새로운 SP 방법인 LASP-2를 소개합니다. 이전 작업인 LASP와 비교하여, LASP-2는 선형 어텐션 레이어에 대한 SP의 최소 통신 요구 사항을 재고하고, LASP의 전체 통신-계산 워크플로우를 재구성합니다. 이렇게 하면 중간 메모리 상태에서 하나의 AllGather 집합 통신만 필요하며, 이 크기는 시퀀스 길이와 독립적이어서 통신 및 계산 병렬화 및 그들의 중첩을 크게 향상시킵니다. 또한, 표준 어텐션 모듈에 유사한 통신 재설계를 적용하여 선형 및 표준 어텐션 레이어를 혼합하는 하이브리드 모델에 대한 효율적인 SP 솔루션을 제공하기 위해 LASP-2를 LASP-2H로 확장합니다. 선형 어텐션을 표준 어텐션으로 대체한 Llama3의 변형인 Linear-Llama3 모델에서 LASP-2 및 LASP-2H의 효과를 입증하는 평가를 수행했습니다. 특히, 64개의 GPU에서 2048K 시퀀스 길이로 LASP보다 15.2% 빠른 훈련 속도 향상 및 Ring Attention보다 36.6% 빠른 훈련 속도 향상을 달성했습니다. 코드는 다음의 일부로 공개되었습니다: https://github.com/OpenSparseLLMs/Linear-MoE.

English

Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.

LASP-2: 선형 어텐션 및 그 하이브리드를 위한 시퀀스 병렬 처리 재고하기

LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

초록

Support