LASP-2:重新思考线性注意力及其混合中的序列并行化
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid
February 11, 2025
作者: Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng
cs.AI
摘要
线性序列建模方法,如线性注意力,相比于序列长度,提供了线性时间训练和常数内存推理等优势。然而,现有的序列并行(SP)方法要么未针对线性注意力的右乘优先特性进行优化,要么采用环形通信策略,导致较低的计算并行性,限制了它们在分布式系统中处理更长序列的可扩展性。本文介绍了LASP-2,一种新的SP方法,用于增强训练具有非常长输入序列的线性注意力变换器模型时的通信和计算并行性。与之前的工作LASP相比,LASP-2重新思考了线性注意力层上SP的最小通信需求,重新组织了LASP的整个通信-计算工作流程。通过这种方式,只需要在中间内存状态上进行一次AllGather集体通信,其大小与序列长度无关,从而显著改善了通信和计算并行性,以及它们的重叠。此外,我们将LASP-2扩展为LASP-2H,通过类似的通信重新设计应用于标准注意力模块,为混合模型提供了高效的SP解决方案,这些模型融合了线性和标准注意力层。我们在Linear-Llama3模型上进行评估,该模型是Llama3的一个变体,其中线性注意力取代了标准注意力,证明了LASP-2和LASP-2H的有效性。具体而言,LASP-2在64个GPU上处理2048K长度序列时,训练速度比LASP提高了15.2%,比Ring Attention提高了36.6%。代码已发布在:https://github.com/OpenSparseLLMs/Linear-MoE。
English
Linear sequence modeling approaches, such as linear attention, provide
advantages like linear-time training and constant-memory inference over
sequence lengths. However, existing sequence parallelism (SP) methods are
either not optimized for the right-product-first feature of linear attention or
use a ring-style communication strategy, which results in lower computation
parallelism, limits their scalability for longer sequences in distributed
systems. In this paper, we introduce LASP-2, a new SP method to enhance both
communication and computation parallelism when training linear attention
transformer models with very-long input sequences. Compared to previous work
LASP, LASP-2 rethinks the minimal communication requirement for SP on linear
attention layers, reorganizes the whole communication-computation workflow of
LASP. In this way, only one single AllGather collective communication is needed
on intermediate memory states, whose sizes are independent of the sequence
length, leading to significant improvements of both communication and
computation parallelism, as well as their overlap. Additionally, we extend
LASP-2 to LASP-2H by applying similar communication redesign to standard
attention modules, offering an efficient SP solution for hybrid models that
blend linear and standard attention layers. Our evaluation on a Linear-Llama3
model, a variant of Llama3 with linear attention replacing standard attention,
demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2
achieves training speed improvements of 15.2% over LASP and 36.6% over Ring
Attention, with a sequence length of 2048K across 64 GPUs. The Code is released
as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.Summary
AI-Generated Summary