精炼解码1：使用流匹配对图像自回归模型进行一步采样

摘要

自回归（AR）模型在文本和图像生成方面取得了最先进的性能，但由于逐标记处理而导致生成速度缓慢。我们提出一个雄心勃勃的问题：能否将预训练的AR模型调整为仅需一两步即可生成输出？如果成功，这将极大推动AR模型的开发和部署。我们注意到，现有尝试通过一次生成多个标记来加快AR生成速度的作品基本上无法捕捉输出分布，因为标记之间存在条件依赖性，限制了它们在少步生成方面的有效性。为了解决这个问题，我们提出了蒸馏解码（DD），它利用流匹配从高斯分布创建确定性映射到预训练AR模型的输出分布。然后，我们训练一个网络来蒸馏这个映射，实现少步生成。DD不需要原始AR模型的训练数据，使其更加实用。我们在最先进的图像AR模型上评估了DD，并在ImageNet-256上展示了有希望的结果。对于需要10步生成的VAR，DD实现了一步生成（加速6.3倍），FID从4.19增加到9.96，但仍可接受。对于LlamaGen，DD将生成步骤从256步减少到1，实现了217.8倍的加速，FID从4.11增加到11.35，但仍可比较。在这两种情况下，基准方法完全失败，FID>100。DD在文本到图像生成方面也表现出色，将LlamaGen的生成步骤从256步减少到2，FID从25.70略微增加到28.95。作为首个展示图像AR模型一步生成可能性的工作，DD挑战了AR模型固有缓慢的普遍观念，为高效AR生成开辟了新机遇。项目网站位于https://imagination-research.github.io/distilled-decoding。

English

Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. DD doesn't need the training data of the original AR model, making it more practical.We evaluate DD on state-of-the-art image AR models and present promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation (6.3times speed-up), with an acceptable increase in FID from 4.19 to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8times speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID>100. DD also excels on text-to-image generation, reducing the generation from 256 steps to 2 for LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The project website is at https://imagination-research.github.io/distilled-decoding.

精炼解码1：使用流匹配对图像自回归模型进行一步采样

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

摘要

Summary

Support