掩码增强的自回归预测：减少注意力以学到更多

摘要

发现大型语言模型（LLMs）在准确检索关键信息方面存在困难。为解决这一问题，我们提出了Mask-Enhanced Autoregressive Prediction（MEAP），这是一种简单而有效的训练范式，将Masked Language Modeling（MLM）无缝集成到Next-Token Prediction（NTP）中，以增强后者的上下文检索能力。具体而言，MEAP首先随机屏蔽少量输入标记，然后直接使用仅解码器的Transformer执行标准的下一个标记预测自回归。MEAP消除了MLM需要双向注意力或编码器-解码器架构的需求，在预训练或推理过程中不会增加额外的计算开销。大量实验证明，MEAP在关键信息检索和长上下文推理任务上明显优于NTP，同时在常识推理任务上表现相当或更好。MEAP的优势还延伸到监督微调，其中在中间迷失场景中显示出显著优势，比NTP高出11.77个百分点。我们的分析表明，MEAP的有效性源于其能够通过集中在减少的一组非屏蔽标记上来促进更可区分的注意力分数。这种机制提高了模型对任务相关信号的关注，同时减轻了外围上下文的影响。这些发现将MEAP定位为大型语言模型的一种有前景的训练范式。

English

Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.

掩码增强的自回归预测：减少注意力以学到更多

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

摘要

Summary

Support