大规模语言扩散模型

摘要

自回归模型（ARMs）被广泛视为大语言模型（LLMs）的基石。我们对此提出挑战，引入了LLaDA，一种在预训练与监督微调（SFT）范式下从头训练的扩散模型。LLaDA通过前向数据掩码过程及由标准Transformer参数化的反向过程来建模分布，预测被掩码的标记。通过优化似然边界，它提供了一种基于原则的生成式概率推理方法。在广泛的基准测试中，LLaDA展现了强大的可扩展性，超越了自建的ARM基线。值得注意的是，LLaDA 8B在上下文学习方面与LLaMA3 8B等强劲LLMs不相上下，且在SFT后，于多轮对话等案例研究中展现出卓越的指令遵循能力。此外，LLaDA解决了反转诅咒问题，在反转诗歌补全任务中超越了GPT-4o。我们的研究确立了扩散模型作为ARMs可行且有前景的替代方案，挑战了上述关键LLM能力必然与ARMs绑定的假设。

English

Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs.

大规模语言扩散模型

Large Language Diffusion Models

摘要

Summary

Support