ChatPaper.aiChatPaper

大规模语言扩散模型

Large Language Diffusion Models

February 14, 2025
作者: Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li
cs.AI

摘要

自回归模型(ARMs)被广泛视为大语言模型(LLMs)的基石。我们对此提出挑战,引入了LLaDA,一种在预训练与监督微调(SFT)范式下从头训练的扩散模型。LLaDA通过前向数据掩码过程及由标准Transformer参数化的反向过程来建模分布,预测被掩码的标记。通过优化似然边界,它提供了一种基于原则的生成式概率推理方法。在广泛的基准测试中,LLaDA展现了强大的可扩展性,超越了自建的ARM基线。值得注意的是,LLaDA 8B在上下文学习方面与LLaMA3 8B等强劲LLMs不相上下,且在SFT后,于多轮对话等案例研究中展现出卓越的指令遵循能力。此外,LLaDA解决了反转诅咒问题,在反转诗歌补全任务中超越了GPT-4o。我们的研究确立了扩散模型作为ARMs可行且有前景的替代方案,挑战了上述关键LLM能力必然与ARMs绑定的假设。
English
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs.

Summary

AI-Generated Summary

PDF1039February 17, 2025