蒸留デコーディング1：フローマッチングを用いた画像オートレグレッシブモデルのワンステップサンプリング

要旨

自己回帰（AR）モデルは、テキストや画像生成において最先端の性能を達成していますが、トークンごとの処理による生成の遅さに悩まされています。我々は野心的な問いを投げかけます：事前に学習されたARモデルを適応して、わずか1〜2ステップで出力を生成することは可能か？成功すれば、ARモデルの開発と展開を大幅に前進させるでしょう。既存の作業では、一度に複数のトークンを生成してAR生成を高速化しようとする試みは、トークン間の条件付き依存関係により出力分布を捉えることが基本的にできないため、数ステップの生成には効果が限定されています。この課題に対処するために、我々は蒸留されたデコーディング（DD）を提案します。これは、フローのマッチングを使用して、事前に学習されたARモデルの出力分布からガウス分布への確定的なマッピングを作成します。その後、このマッピングを蒸留するネットワークをトレーニングし、数ステップの生成を可能にします。DDは元のARモデルのトレーニングデータを必要とせず、より実用的です。我々は、最先端の画像ARモデルにおいてDDを評価し、ImageNet-256において有望な結果を示します。VARに対しては、10ステップの生成が必要な場合、DDは1ステップの生成を可能にし（6.3倍の高速化）、FIDが4.19から9.96に許容範囲内で増加します。LlamaGenにおいては、DDは256ステップから1ステップの生成に削減し、FIDが4.11から11.35に増加しつつ217.8倍の高速化を達成します。いずれの場合も、ベースライン手法はFIDが100を超えると完全に失敗します。DDはまた、テキストから画像への生成においても優れており、LlamaGenにおいて256ステップから2ステップの生成に削減し、FIDが25.70から28.95にわずかに増加します。画像ARモデルにおいて1ステップ生成の可能性を示す最初の作業として、DDはARモデルが本質的に遅いという一般的な考えに挑戦し、効率的なAR生成の新たな機会を開拓します。プロジェクトのウェブサイトは、https://imagination-research.github.io/distilled-decoding にあります。

English

Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. DD doesn't need the training data of the original AR model, making it more practical.We evaluate DD on state-of-the-art image AR models and present promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation (6.3times speed-up), with an acceptable increase in FID from 4.19 to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8times speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID>100. DD also excels on text-to-image generation, reducing the generation from 256 steps to 2 for LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The project website is at https://imagination-research.github.io/distilled-decoding.

蒸留デコーディング1：フローマッチングを用いた画像オートレグレッシブモデルのワンステップサンプリング

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

要旨

Summary

Support

Support