精炼解码1:使用流匹配对图像自回归模型进行一步采样
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
December 22, 2024
作者: Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin
cs.AI
摘要
自回归(AR)模型在文本和图像生成方面取得了最先进的性能,但由于逐标记处理而导致生成速度缓慢。我们提出一个雄心勃勃的问题:能否将预训练的AR模型调整为仅需一两步即可生成输出?如果成功,这将极大推动AR模型的开发和部署。我们注意到,现有尝试通过一次生成多个标记来加快AR生成速度的作品基本上无法捕捉输出分布,因为标记之间存在条件依赖性,限制了它们在少步生成方面的有效性。为了解决这个问题,我们提出了蒸馏解码(DD),它利用流匹配从高斯分布创建确定性映射到预训练AR模型的输出分布。然后,我们训练一个网络来蒸馏这个映射,实现少步生成。DD不需要原始AR模型的训练数据,使其更加实用。我们在最先进的图像AR模型上评估了DD,并在ImageNet-256上展示了有希望的结果。对于需要10步生成的VAR,DD实现了一步生成(加速6.3倍),FID从4.19增加到9.96,但仍可接受。对于LlamaGen,DD将生成步骤从256步减少到1,实现了217.8倍的加速,FID从4.11增加到11.35,但仍可比较。在这两种情况下,基准方法完全失败,FID>100。DD在文本到图像生成方面也表现出色,将LlamaGen的生成步骤从256步减少到2,FID从25.70略微增加到28.95。作为首个展示图像AR模型一步生成可能性的工作,DD挑战了AR模型固有缓慢的普遍观念,为高效AR生成开辟了新机遇。项目网站位于https://imagination-research.github.io/distilled-decoding。
English
Autoregressive (AR) models have achieved state-of-the-art performance in text
and image generation but suffer from slow generation due to the token-by-token
process. We ask an ambitious question: can a pre-trained AR model be adapted to
generate outputs in just one or two steps? If successful, this would
significantly advance the development and deployment of AR models. We notice
that existing works that try to speed up AR generation by generating multiple
tokens at once fundamentally cannot capture the output distribution due to the
conditional dependencies between tokens, limiting their effectiveness for
few-step generation. To address this, we propose Distilled Decoding (DD), which
uses flow matching to create a deterministic mapping from Gaussian distribution
to the output distribution of the pre-trained AR model. We then train a network
to distill this mapping, enabling few-step generation. DD doesn't need the
training data of the original AR model, making it more practical.We evaluate DD
on state-of-the-art image AR models and present promising results on
ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step
generation (6.3times speed-up), with an acceptable increase in FID from 4.19
to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an
217.8times speed-up with a comparable FID increase from 4.11 to 11.35. In
both cases, baseline methods completely fail with FID>100. DD also excels on
text-to-image generation, reducing the generation from 256 steps to 2 for
LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to
demonstrate the possibility of one-step generation for image AR models, DD
challenges the prevailing notion that AR models are inherently slow, and opens
up new opportunities for efficient AR generation. The project website is at
https://imagination-research.github.io/distilled-decoding.Summary
AI-Generated Summary