Token-Shuffle：邁向基於自回歸模型的高解析度影像生成

摘要

自回歸（AR）模型長期主導語言生成領域，現正逐漸應用於圖像合成，但通常被認為不如基於擴散（Diffusion）的模型具有競爭力。其主要限制在於AR模型需要大量的圖像標記，這制約了訓練和推理效率以及圖像分辨率。為解決這一問題，我們提出了Token-Shuffle，這是一種新穎而簡單的方法，可減少Transformer中的圖像標記數量。我們的關鍵洞察是多模態大語言模型（MLLMs）中視覺詞彙的維度冗餘，其中來自視覺編碼器的低維視覺代碼直接映射到高維語言詞彙。基於此，我們考慮了兩個關鍵操作：token-shuffle，沿通道維度合併空間局部標記以減少輸入標記數量；以及token-unshuffle，在Transformer塊後解開推斷出的標記以恢復輸出的空間排列。與文本提示聯合訓練，我們的策略無需額外的預訓練文本編碼器，並使MLLMs能夠以統一的下一標記預測方式支持極高分辨率的圖像合成，同時保持高效的訓練和推理。我們首次將AR文本到圖像生成的邊界推至2048x2048分辨率，並獲得了令人滿意的生成性能。在GenAI基準測試中，我們的2.7B模型在困難提示上獲得了0.77的總分，優於AR模型LlamaGen 0.18分，並領先擴散模型LDM 0.15分。全面的大規模人類評估也顯示了我們在文本對齊、視覺缺陷和視覺外觀方面的卓越圖像生成能力。我們希望Token-Shuffle能成為MLLMs中高效高分辨率圖像生成的基礎設計。

English

Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.

Token-Shuffle：邁向基於自回歸模型的高解析度影像生成

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

摘要

Summary

Support

Support