无限:为高分辨率图像合成扩展位自回归建模
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
December 5, 2024
作者: Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu
cs.AI
摘要
我们提出了Infinity,一种位运算视觉自回归建模技术,能够根据语言指令生成高分辨率、逼真的图像。Infinity在位元预测框架下重新定义了视觉自回归模型,采用无限词汇标记器和分类器以及位元自校正机制,显著提高了生成能力和细节。通过在理论上将标记器词汇量扩展到无限大,并同时扩展变压器的大小,我们的方法相对于基本的VAR大大释放了强大的扩展能力。Infinity创造了自回归文本到图像模型的新纪录,胜过了顶尖的扩散模型,如SD3-Medium和SDXL。值得注意的是,Infinity通过将GenEval基准分数从0.62提高到0.73,将ImageReward基准分数从0.87提高到0.96,取得了66%的胜率,超越了SD3-Medium。在没有额外优化的情况下,Infinity能够在0.8秒内生成高质量的1024x1024图像,比SD3-Medium快2.6倍,成为最快的文本到图像模型。模型和代码将会发布,以促进对Infinity在视觉生成和统一标记器建模方面的进一步探索。
English
We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of
generating high-resolution, photorealistic images following language
instruction. Infinity redefines visual autoregressive model under a bitwise
token prediction framework with an infinite-vocabulary tokenizer & classifier
and bitwise self-correction mechanism, remarkably improving the generation
capacity and details. By theoretically scaling the tokenizer vocabulary size to
infinity and concurrently scaling the transformer size, our method
significantly unleashes powerful scaling capabilities compared to vanilla VAR.
Infinity sets a new record for autoregressive text-to-image models,
outperforming top-tier diffusion models like SD3-Medium and SDXL. Notably,
Infinity surpasses SD3-Medium by improving the GenEval benchmark score from
0.62 to 0.73 and the ImageReward benchmark score from 0.87 to 0.96, achieving a
win rate of 66%. Without extra optimization, Infinity generates a high-quality
1024x1024 image in 0.8 seconds, making it 2.6x faster than SD3-Medium and
establishing it as the fastest text-to-image model. Models and codes will be
released to promote further exploration of Infinity for visual generation and
unified tokenizer modeling.Summary
AI-Generated Summary