랜덤화된 자기회귀적 시각 생성

초록

본 논문은 시각 생성을 위한 Randomized AutoRegressive 모델링 (RAR)을 제안하며, 이미지 생성 작업에서 새로운 최고 수준의 성능을 달성하면서 언어 모델링 프레임워크와 완전히 호환성을 유지합니다. 제안된 RAR은 간단합니다: 일반적인 자기 회귀적 훈련 과정에서 다음 토큰 예측 목표와 함께, 입력 시퀀스는 일반적으로 래스터 형식으로 정렬되며, 확률 r로 서로 다른 인수 분해 순서로 무작위로 순열됩니다. 여기서 r은 1에서 시작하여 훈련 과정 중에 선형적으로 0으로 감소합니다. 이 어닐링 훈련 전략을 통해 모델은 모든 인수 분해 순서에 대한 기대 우도를 최대화하는 학습을 하여 양방향 컨텍스트를 효과적으로 모델링할 수 있게 됩니다. 중요한 점은, RAR은 자기 회귀적 모델링 프레임워크의 무결성을 유지하면서 이미지 생성에서 성능을 크게 향상시킵니다. ImageNet-256 벤치마크에서 RAR은 1.48의 FID 점수를 달성하며, 이전 최고 수준의 자기 회귀적 이미지 생성기를 뛰어넘을 뿐만 아니라 선도적인 확산 기반 및 가리개 트랜스포머 기반 방법들을 능가합니다. 코드 및 모델은 https://github.com/bytedance/1d-tokenizer에서 제공될 예정입니다.

English

This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made available at https://github.com/bytedance/1d-tokenizer

랜덤화된 자기회귀적 시각 생성

Randomized Autoregressive Visual Generation

초록

Support