無限:為高解析度影像合成擴展的位元自回歸建模
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
December 5, 2024
作者: Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu
cs.AI
摘要
我們提出了Infinity,一種位元視覺自迴歸建模技術,能夠根據語言指示生成高解析度、照片般逼真的圖像。Infinity在位元標記預測框架下重新定義了視覺自迴歸模型,採用無限詞彙標記器和分類器,以及位元自我校正機制,顯著提升了生成能力和細節。通過在理論上將標記器詞彙大小擴展至無限,同時擴展變壓器大小,我們的方法相較於基本的VAR顯著釋放了強大的擴展能力。Infinity創下了自迴歸文本到圖像模型的新紀錄,勝過了頂尖的擴散模型如SD3-Medium和SDXL。值得注意的是,Infinity通過將GenEval基準分數從0.62提升至0.73,將ImageReward基準分數從0.87提升至0.96,取得了66%的勝率,勝過了SD3-Medium。在沒有額外優化的情況下,Infinity能在0.8秒內生成高質量的1024x1024圖像,比SD3-Medium快2.6倍,成為最快的文本到圖像模型。模型和代碼將被釋出,以促進對Infinity在視覺生成和統一標記器建模方面的進一步探索。
English
We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of
generating high-resolution, photorealistic images following language
instruction. Infinity redefines visual autoregressive model under a bitwise
token prediction framework with an infinite-vocabulary tokenizer & classifier
and bitwise self-correction mechanism, remarkably improving the generation
capacity and details. By theoretically scaling the tokenizer vocabulary size to
infinity and concurrently scaling the transformer size, our method
significantly unleashes powerful scaling capabilities compared to vanilla VAR.
Infinity sets a new record for autoregressive text-to-image models,
outperforming top-tier diffusion models like SD3-Medium and SDXL. Notably,
Infinity surpasses SD3-Medium by improving the GenEval benchmark score from
0.62 to 0.73 and the ImageReward benchmark score from 0.87 to 0.96, achieving a
win rate of 66%. Without extra optimization, Infinity generates a high-quality
1024x1024 image in 0.8 seconds, making it 2.6x faster than SD3-Medium and
establishing it as the fastest text-to-image model. Models and codes will be
released to promote further exploration of Infinity for visual generation and
unified tokenizer modeling.Summary
AI-Generated Summary