정밀도가 위치를 만나면: BFloat16이 긴 맥락에서의 RoPE를 해체합니다.

초록

컨텍스트 창 크기를 확장하면 대규모 언어 모델(Large Language Models, LLMs)이 더 긴 시퀀스를 처리하고 더 복잡한 작업을 수행할 수 있습니다. 회전 위치 임베딩(Rotary Positional Embedding, RoPE)은 상대적인 위치 인코딩 특성으로 인해 긴 컨텍스트 훈련에 이점을 주어 de facto 표준이 되었습니다. 그러나 우리는 BFloat16 형식과 함께 RoPE를 사용할 때 숫자 문제가 발생하여 의도한 상대적인 위치 인코딩에서 특히 긴 컨텍스트 시나리오에서 벗어나는 것을 관찰했습니다. 이 문제는 BFloat16의 제한된 정밀도에서 발생하며 컨텍스트 길이가 증가함에 따라 누적되며, 첫 번째 토큰이 이 문제에 상당한 영향을 미칩니다. 이를 해결하기 위해 BFloat16에서 발생하는 숫자 문제를 완화하고 긴 컨텍스트 기능을 향상시키며 훈련 속도를 높이는 플러그 앤 플레이 어텐션 방법인 AnchorAttention을 개발했습니다. AnchorAttention은 불필요한 어텐션 계산을 줄이고 의미론적 일관성을 유지하며, 첫 번째 토큰을 일관된 위치 ID로 공유 앵커로 취급하여 훈련 컨텍스트 내 모든 문서에서 볼 수 있도록 함으로써 계산 효율성을 향상시킵니다. 세 가지 유형의 LLM에 대한 실험 결과, AnchorAttention이 표준 전체 어텐션 메커니즘과 비교하여 훈련 시간을 50% 이상 단축하면서 긴 컨텍스트 성능을 크게 향상시킨다는 것을 보여줍니다. 우리의 코드는 https://github.com/haonan3/AnchorContext에서 확인할 수 있습니다.

English

Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks. Our code is available at https://github.com/haonan3/AnchorContext.

정밀도가 위치를 만나면: BFloat16이 긴 맥락에서의 RoPE를 해체합니다.

When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

초록

Summary

Support