当精度遇上位置:BFloat16在长上下文训练中突破RoPE
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
November 20, 2024
作者: Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang
cs.AI
摘要
扩大上下文窗口大小使大型语言模型(LLMs)能够处理更长的序列并处理更复杂的任务。旋转位置嵌入(RoPE)已成为事实上的标准,因为它具有有利于长上下文训练的相对位置编码特性。然而,我们观察到,使用BFloat16格式的RoPE会导致数值问题,使其偏离其预期的相对位置编码,特别是在长上下文场景中。这个问题源于BFloat16的有限精度,并随着上下文长度的增加而累积,其中第一个标记对这个问题有着显著的贡献。为了解决这个问题,我们开发了AnchorAttention,这是一种即插即用的注意力方法,可以缓解BFloat16引起的数值问题,改进长上下文能力,并加快训练速度。AnchorAttention减少了不必要的注意力计算,保持语义连贯性,并通过将第一个标记视为具有一致位置ID的共享锚点来提高计算效率,使其对训练上下文中的所有文档可见。对三种类型的LLMs进行的实验表明,AnchorAttention显著提高了长上下文性能,并将训练时间缩短了50\%以上,与标准的全注意力机制相比,同时保留了原始LLM在一般任务上的能力。我们的代码可在https://github.com/haonan3/AnchorContext找到。
English
Extending context window sizes allows large language models (LLMs) to process
longer sequences and handle more complex tasks. Rotary Positional Embedding
(RoPE) has become the de facto standard due to its relative positional encoding
properties that benefit long-context training. However, we observe that using
RoPE with BFloat16 format results in numerical issues, causing it to deviate
from its intended relative positional encoding, especially in long-context
scenarios. This issue arises from BFloat16's limited precision and accumulates
as context length increases, with the first token contributing significantly to
this problem. To address this, we develop AnchorAttention, a plug-and-play
attention method that alleviates numerical issues caused by BFloat16, improves
long-context capabilities, and speeds up training. AnchorAttention reduces
unnecessary attention computations, maintains semantic coherence, and boosts
computational efficiency by treating the first token as a shared anchor with a
consistent position ID, making it visible to all documents within the training
context. Experiments on three types of LLMs demonstrate that AnchorAttention
significantly improves long-context performance and reduces training time by
over 50\% compared to standard full attention mechanisms, while preserving the
original LLM's capabilities on general tasks. Our code is available at
https://github.com/haonan3/AnchorContext.Summary
AI-Generated Summary