當精確度遇上位置:BFloat16 在長文本訓練中突破 RoPE
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
November 20, 2024
作者: Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang
cs.AI
摘要
擴大上下文窗口大小使大型語言模型(LLMs)能夠處理更長的序列並處理更複雜的任務。旋轉位置嵌入(RoPE)已成為事實上的標準,因為其相對位置編碼特性有助於長上下文訓練。然而,我們觀察到使用BFloat16格式的RoPE會導致數值問題,使其偏離其預期的相對位置編碼,特別是在長上下文情況下。這個問題源於BFloat16的有限精度,並隨著上下文長度的增加而累積,其中第一個標記對這個問題有顯著影響。為了解決這個問題,我們開發了AnchorAttention,一種即插即用的注意力方法,可以減輕BFloat16引起的數值問題,改善長上下文能力並加快訓練速度。AnchorAttention減少了不必要的注意力計算,保持了語義一致性,並通過將第一個標記視為具有一致位置ID的共享錨點,使其對訓練上下文中的所有文檔可見,從而提高了計算效率。對三種類型的LLMs進行的實驗表明,AnchorAttention顯著改善了長上下文性能,並將訓練時間與標準全注意機制相比減少了50%以上,同時保留了原始LLM在一般任務上的能力。我們的代碼可在https://github.com/haonan3/AnchorContext找到。
English
Extending context window sizes allows large language models (LLMs) to process
longer sequences and handle more complex tasks. Rotary Positional Embedding
(RoPE) has become the de facto standard due to its relative positional encoding
properties that benefit long-context training. However, we observe that using
RoPE with BFloat16 format results in numerical issues, causing it to deviate
from its intended relative positional encoding, especially in long-context
scenarios. This issue arises from BFloat16's limited precision and accumulates
as context length increases, with the first token contributing significantly to
this problem. To address this, we develop AnchorAttention, a plug-and-play
attention method that alleviates numerical issues caused by BFloat16, improves
long-context capabilities, and speeds up training. AnchorAttention reduces
unnecessary attention computations, maintains semantic coherence, and boosts
computational efficiency by treating the first token as a shared anchor with a
consistent position ID, making it visible to all documents within the training
context. Experiments on three types of LLMs demonstrate that AnchorAttention
significantly improves long-context performance and reduces training time by
over 50\% compared to standard full attention mechanisms, while preserving the
original LLM's capabilities on general tasks. Our code is available at
https://github.com/haonan3/AnchorContext.Summary
AI-Generated Summary