스타 어텐션: 긴 시퀀스 상에서 효율적인 LLM 추론

초록

트랜스포머 기반 대형 언어 모델(LLMs)에 대한 추론은 자기 주의 메커니즘의 이차 복잡성으로 인해 비용이 많이 들고 느립니다. 본 연구에서는 Star Attention을 소개합니다. 이는 여러 호스트에 걸쳐 어텐션을 분할하여 통신 오버헤드를 최소화하면서 계산 효율성을 향상시키는 두 단계의 블록-희소 근사법입니다. 첫 번째 단계에서는 컨텍스트가 병렬로 호스트 간 블록별 로컬 어텐션을 사용하여 처리됩니다. 두 번째 단계에서는 쿼리 및 응답 토큰이 모든 이전 캐시된 토큰에 대해 시퀀스-전역 어텐션을 통해 참여합니다. Star Attention은 대부분의 글로벌 어텐션으로 훈련된 트랜스포머 기반 LLMs와 원활하게 통합되며, 메모리 요구 사항을 줄이고 정확도를 95-100% 유지하면서 추론 시간을 최대 11배 단축시킵니다.

English

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.

스타 어텐션: 긴 시퀀스 상에서 효율적인 LLM 추론

Star Attention: Efficient LLM Inference over Long Sequences

초록

Summary

Support