当精度遇上位置：BFloat16在长上下文训练中突破RoPE

摘要

扩大上下文窗口大小使大型语言模型（LLMs）能够处理更长的序列并处理更复杂的任务。旋转位置嵌入（RoPE）已成为事实上的标准，因为它具有有利于长上下文训练的相对位置编码特性。然而，我们观察到，使用BFloat16格式的RoPE会导致数值问题，使其偏离其预期的相对位置编码，特别是在长上下文场景中。这个问题源于BFloat16的有限精度，并随着上下文长度的增加而累积，其中第一个标记对这个问题有着显著的贡献。为了解决这个问题，我们开发了AnchorAttention，这是一种即插即用的注意力方法，可以缓解BFloat16引起的数值问题，改进长上下文能力，并加快训练速度。AnchorAttention减少了不必要的注意力计算，保持语义连贯性，并通过将第一个标记视为具有一致位置ID的共享锚点来提高计算效率，使其对训练上下文中的所有文档可见。对三种类型的LLMs进行的实验表明，AnchorAttention显著提高了长上下文性能，并将训练时间缩短了50\%以上，与标准的全注意力机制相比，同时保留了原始LLM在一般任务上的能力。我们的代码可在https://github.com/haonan3/AnchorContext找到。

English

Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks. Our code is available at https://github.com/haonan3/AnchorContext.

当精度遇上位置：BFloat16在长上下文训练中突破RoPE

When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

摘要

Summary

Support