基于随机并行解码的自回归图像生成

摘要

我们提出了ARPG，一种新颖的视觉自回归模型，它支持随机并行生成，有效解决了传统光栅顺序方法在推理效率和零样本泛化能力上的固有局限。这些局限源于其顺序、预定义的令牌生成方式。我们的核心洞见在于，有效的随机顺序建模需要明确的指导来确定下一个预测令牌的位置。为此，我们设计了一种创新的引导解码框架，将位置引导与内容表示解耦，分别编码为查询和键值对。通过直接将这种引导融入因果注意力机制，我们的方法实现了完全随机顺序的训练与生成，无需双向注意力。因此，ARPG能够轻松泛化至图像修复、扩展及分辨率提升等零样本任务。此外，它通过共享KV缓存并行处理多个查询，支持并行推理。在ImageNet-1K 256基准测试中，我们的方法仅用64个采样步骤便达到了1.94的FID分数，相比近期同规模的自回归模型，吞吐量提升了20倍以上，同时内存消耗减少了超过75%。

English

We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel guided decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput while reducing memory consumption by over 75% compared to representative recent autoregressive models at a similar scale.

基于随机并行解码的自回归图像生成

Autoregressive Image Generation with Randomized Parallel Decoding

摘要

Summary

Support