随机自回归视觉生成

摘要

本文提出了用于视觉生成的随机自回归建模（RAR），在保持与语言建模框架完全兼容的同时，在图像生成任务上取得了新的最先进性能。所提出的RAR方法很简单：在标准的自回归训练过程中，采用下一个标记预测目标，输入序列（通常以光栅形式排序）以概率r被随机排列成不同的因子分解顺序，其中r从1开始，在训练过程中线性衰减至0。这种退火训练策略使模型能够学习最大化所有因子分解顺序上的期望似然，从而有效提高了模型建模双向上下文的能力。值得注意的是，RAR保留了自回归建模框架的完整性，确保与语言建模完全兼容，同时在图像生成方面显著提高了性能。在ImageNet-256基准测试中，RAR实现了1.48的FID分数，不仅超越了先前最先进的自回归图像生成器，还优于领先的基于扩散和基于掩蔽变换器的方法。代码和模型将在https://github.com/bytedance/1d-tokenizer上提供。

English

This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made available at https://github.com/bytedance/1d-tokenizer

随机自回归视觉生成

Randomized Autoregressive Visual Generation

摘要

Summary

Support

Support