Emu3：下一個標記預測就是你所需的一切

摘要

儘管下一個令牌預測被視為通往人工通用智能的有前途途徑，但在多模式任務中卻難以優秀，這些任務仍然被擴散模型（例如，穩定擴散）和組合方法（例如，CLIP結合LLM）所主導。在本文中，我們介紹了Emu3，這是一套全新的最先進多模式模型，僅通過下一個令牌預測進行訓練。通過將圖像、文本和視頻標記化為離散空間，我們在混合多模式序列上從頭開始訓練一個單一的Transformer。Emu3在生成和感知任務中優於幾個知名的特定任務模型，超越了旗艦模型如SDXL和LLaVA-1.6，同時消除了擴散或組合結構的需求。Emu3還能夠通過預測視頻序列中的下一個令牌來生成高保真度的視頻。我們通過專注於一個核心焦點：令牌，簡化了複雜的多模式模型設計，從而在訓練和推斷過程中實現了巨大的潛力擴展。我們的結果表明，下一個令牌預測是通往超越語言範疇的通用多模式智能建構的有前途途徑。我們開源了關鍵技術和模型，以支持在這個方向上進一步的研究。

English

While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.

Emu3：下一個標記預測就是你所需的一切

Emu3: Next-Token Prediction is All You Need

摘要

Summary

Support

Support