大规模语言扩散模型
Large Language Diffusion Models
February 14, 2025
作者: Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li
cs.AI
摘要
自回归模型(ARMs)被广泛视为大语言模型(LLMs)的基石。我们对此提出挑战,引入了LLaDA,一种在预训练与监督微调(SFT)范式下从头训练的扩散模型。LLaDA通过前向数据掩码过程及由标准Transformer参数化的反向过程来建模分布,预测被掩码的标记。通过优化似然边界,它提供了一种基于原则的生成式概率推理方法。在广泛的基准测试中,LLaDA展现了强大的可扩展性,超越了自建的ARM基线。值得注意的是,LLaDA 8B在上下文学习方面与LLaMA3 8B等强劲LLMs不相上下,且在SFT后,于多轮对话等案例研究中展现出卓越的指令遵循能力。此外,LLaDA解决了反转诅咒问题,在反转诗歌补全任务中超越了GPT-4o。我们的研究确立了扩散模型作为ARMs可行且有前景的替代方案,挑战了上述关键LLM能力必然与ARMs绑定的假设。
English
Autoregressive models (ARMs) are widely regarded as the cornerstone of large
language models (LLMs). We challenge this notion by introducing LLaDA, a
diffusion model trained from scratch under the pre-training and supervised
fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data
masking process and a reverse process, parameterized by a vanilla Transformer
to predict masked tokens. By optimizing a likelihood bound, it provides a
principled generative approach for probabilistic inference. Across extensive
benchmarks, LLaDA demonstrates strong scalability, outperforming our
self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong
LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive
instruction-following abilities in case studies such as multi-turn dialogue.
Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal
poem completion task. Our findings establish diffusion models as a viable and
promising alternative to ARMs, challenging the assumption that key LLM
capabilities discussed above are inherently tied to ARMs.Summary
AI-Generated Summary