마스크 강화된 자기회귀 예측: 더 배우기 위해 덜 주의를 기울이다

초록

대형 언어 모델 (LLM)은 주요 정보를 정확하게 검색하는 데 어려움을 겪는 것으로 밝혀졌습니다. 이를 해결하기 위해 우리는 Mask-Enhanced Autoregressive Prediction (MEAP)을 제안합니다. 이는 간단하면서도 효과적인 훈련 패러다임으로, Masked Language Modeling (MLM)을 Next-Token Prediction (NTP)에 통합하여 후자의 문맥 내 검색 능력을 향상시킵니다. 구체적으로, MEAP은 먼저 입력 토큰의 소수를 무작위로 마스킹하고, 그런 다음 디코더 전용 Transformer를 사용하여 표준 다음 토큰 예측 자기 회귀를 직접 수행합니다. MEAP은 MLM을 위한 양방향 어텐션 또는 인코더-디코더 아키텍처를 필요로하지 않으며, 사전 훈련 또는 추론 중에 추가 계산 부담이 없습니다. 철저한 실험을 통해 MEAP이 주요 정보 검색 및 장문 맥락 추론 작업에서 NTP보다 현저히 우수한 성능을 보이는 반면 상식적 추론 작업에서는 비슷하거나 더 나은 성과를 거두는 것을 확인했습니다. MEAP의 장점은 지중에서 잃어버린 시나리오에서 놀라운 이점을 보이며, NTP보다 11.77% 포인트 우위를 차지합니다. 우리의 분석은 MEAP의 효과성이 마스킹되지 않은 토큰 집합에 집중함으로써 더 분명한 어텐션 점수를 촉진하는 능력에서 비롯된다는 것을 나타냅니다. 이 메커니즘은 모델이 작업 관련 신호에 집중하고 주변 맥락의 영향을 완화하는 데 도움이 됩니다. 이러한 발견은 MEAP을 대형 언어 모델을 위한 유망한 훈련 패러다임으로 위치시킵니다.

English

Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.

마스크 강화된 자기회귀 예측: 더 배우기 위해 덜 주의를 기울이다

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

초록

Support