무한대: 고해상도 이미지 합성을 위한 비트 단위 자기회귀 모델 확장

초록

우리는 언어 지시를 따르며 고해상도 사진을 생성할 수 있는 비트 단위 비주얼 오토리그레시브 모델링인 Infinity를 제안합니다. Infinity는 무한 어휘 토크나이저 및 분류기와 비트 단위 자가 교정 메커니즘을 사용하여 비주얼 오토리그레시브 모델을 재정의하며 생성 능력과 세부 정보를 현저히 향상시킵니다. 어휘 토크나이저 크기를 이론적으로 무한대로 확장하고 동시에 트랜스포머 크기를 확장함으로써, 우리의 방법은 바닐라 VAR에 비해 강력한 확장 능력을 발휘합니다. Infinity는 SD3-Medium 및 SDXL과 같은 최고 수준의 확산 모델을 능가하는 자동 회귀 텍스트-이미지 모델을 위한 새로운 기록을 세웁니다. 특히, Infinity는 GenEval 벤치마크 점수를 0.62에서 0.73으로 향상시키고 ImageReward 벤치마크 점수를 0.87에서 0.96으로 향상시켜 66%의 승률을 달성하여 SD3-Medium을 능가합니다. 추가 최적화 없이 Infinity는 0.8초 안에 고품질의 1024x1024 이미지를 생성하여 SD3-Medium보다 2.6배 빠르게 만들어내며 최고의 텍스트-이미지 모델로 자리매김합니다. 모델 및 코드는 Infinity의 시각적 생성 및 통합 토크나이저 모델링을 위한 추가 탐구를 촉진하기 위해 공개될 예정입니다.

English

We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction. Infinity redefines visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary tokenizer & classifier and bitwise self-correction mechanism, remarkably improving the generation capacity and details. By theoretically scaling the tokenizer vocabulary size to infinity and concurrently scaling the transformer size, our method significantly unleashes powerful scaling capabilities compared to vanilla VAR. Infinity sets a new record for autoregressive text-to-image models, outperforming top-tier diffusion models like SD3-Medium and SDXL. Notably, Infinity surpasses SD3-Medium by improving the GenEval benchmark score from 0.62 to 0.73 and the ImageReward benchmark score from 0.87 to 0.96, achieving a win rate of 66%. Without extra optimization, Infinity generates a high-quality 1024x1024 image in 0.8 seconds, making it 2.6x faster than SD3-Medium and establishing it as the fastest text-to-image model. Models and codes will be released to promote further exploration of Infinity for visual generation and unified tokenizer modeling.

무한대: 고해상도 이미지 합성을 위한 비트 단위 자기회귀 모델 확장

Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

초록

Summary

Support