随机自回归视觉生成
Randomized Autoregressive Visual Generation
November 1, 2024
作者: Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen
cs.AI
摘要
本文提出了用于视觉生成的随机自回归建模(RAR),在保持与语言建模框架完全兼容的同时,在图像生成任务上取得了新的最先进性能。所提出的RAR方法很简单:在标准的自回归训练过程中,采用下一个标记预测目标,输入序列(通常以光栅形式排序)以概率r被随机排列成不同的因子分解顺序,其中r从1开始,在训练过程中线性衰减至0。这种退火训练策略使模型能够学习最大化所有因子分解顺序上的期望似然,从而有效提高了模型建模双向上下文的能力。值得注意的是,RAR保留了自回归建模框架的完整性,确保与语言建模完全兼容,同时在图像生成方面显著提高了性能。在ImageNet-256基准测试中,RAR实现了1.48的FID分数,不仅超越了先前最先进的自回归图像生成器,还优于领先的基于扩散和基于掩蔽变换器的方法。代码和模型将在https://github.com/bytedance/1d-tokenizer上提供。
English
This paper presents Randomized AutoRegressive modeling (RAR) for visual
generation, which sets a new state-of-the-art performance on the image
generation task while maintaining full compatibility with language modeling
frameworks. The proposed RAR is simple: during a standard autoregressive
training process with a next-token prediction objective, the input
sequence-typically ordered in raster form-is randomly permuted into different
factorization orders with a probability r, where r starts at 1 and linearly
decays to 0 over the course of training. This annealing training strategy
enables the model to learn to maximize the expected likelihood over all
factorization orders and thus effectively improve the model's capability of
modeling bidirectional contexts. Importantly, RAR preserves the integrity of
the autoregressive modeling framework, ensuring full compatibility with
language modeling while significantly improving performance in image
generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48,
not only surpassing prior state-of-the-art autoregressive image generators but
also outperforming leading diffusion-based and masked transformer-based
methods. Code and models will be made available at
https://github.com/bytedance/1d-tokenizerSummary
AI-Generated Summary