掩码增强的自回归预测:减少注意力以学到更多
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
February 11, 2025
作者: Xialie Zhuang, Zhikai Jia, Jianjin Li, Zhenyu Zhang, Li Shen, Zheng Cao, Shiwei Liu
cs.AI
摘要
发现大型语言模型(LLMs)在准确检索关键信息方面存在困难。为解决这一问题,我们提出了Mask-Enhanced Autoregressive Prediction(MEAP),这是一种简单而有效的训练范式,将Masked Language Modeling(MLM)无缝集成到Next-Token Prediction(NTP)中,以增强后者的上下文检索能力。具体而言,MEAP首先随机屏蔽少量输入标记,然后直接使用仅解码器的Transformer执行标准的下一个标记预测自回归。MEAP消除了MLM需要双向注意力或编码器-解码器架构的需求,在预训练或推理过程中不会增加额外的计算开销。大量实验证明,MEAP在关键信息检索和长上下文推理任务上明显优于NTP,同时在常识推理任务上表现相当或更好。MEAP的优势还延伸到监督微调,其中在中间迷失场景中显示出显著优势,比NTP高出11.77个百分点。我们的分析表明,MEAP的有效性源于其能够通过集中在减少的一组非屏蔽标记上来促进更可区分的注意力分数。这种机制提高了模型对任务相关信号的关注,同时减轻了外围上下文的影响。这些发现将MEAP定位为大型语言模型的一种有前景的训练范式。
English
Large Language Models (LLMs) are discovered to suffer from accurately
retrieving key information. To address this, we propose Mask-Enhanced
Autoregressive Prediction (MEAP), a simple yet effective training paradigm that
seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction
(NTP) to enhance the latter's in-context retrieval capabilities. Specifically,
MEAP first randomly masks a small fraction of input tokens and then directly
performs the standard next-token prediction autoregressive using a decoder-only
Transformer. MEAP eliminates the need for bidirectional attention or
encoder-decoder architectures for MLM, incurring no additional computational
overhead during pre-training or inference. Intensive experiments demonstrate
that MEAP substantially outperforms NTP on key information retrieval and
long-context reasoning tasks, while performing on par or better on commonsense
reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning,
where it shows remarkable advantages in lost-in-the-middle scenarios,
outperforming NTP by 11.77 percentage points. Our analysis indicates that
MEAP's effectiveness arises from its ability to promote more distinguishable
attention scores by concentrating on a reduced set of non-masked tokens. This
mechanism improves the model's focus on task-relevant signals while mitigating
the influence of peripheral context. These findings position MEAP as a
promising training paradigm for large language models.Summary
AI-Generated Summary