精煉解碼 1:使用流匹配對圖像自回歸模型進行一步採樣

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

December 22, 2024
作者: Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin
cs.AI

摘要

自回歸(AR)模型在文本和圖像生成方面取得了最先進的性能,但由於逐標記過程而導致生成速度緩慢。我們提出了一個雄心勃勃的問題:能否將預訓練的AR模型調整以僅在一兩個步驟中生成輸出?如果成功,將顯著推進AR模型的開發和部署。我們注意到,現有的旨在加快AR生成速度的作品,通過一次生成多個標記,基本上無法捕捉輸出分佈,因為標記之間存在條件依賴性,限制了它們對於少步生成的有效性。為了解決這個問題,我們提出了蒸餾解碼(DD),它使用流匹配來創建從高斯分佈到預訓練AR模型輸出分佈的確定性映射。然後,我們訓練一個網絡來蒸餾這個映射,實現少步生成。DD不需要原始AR模型的訓練數據,使其更實用。我們在最先進的圖像AR模型上評估了DD,並在ImageNet-256上呈現了令人期待的結果。對於需要進行10步生成的VAR,DD實現了一步生成(加速6.3倍),FID從4.19增加到9.96,仍然可接受。對於LlamaGen,DD將生成步驟從256步減少到1,實現了217.8倍的加速,FID從4.11增加到11.35,相當可觀。在這兩種情況下,基準方法完全失敗,FID>100。DD在文本到圖像生成方面也表現出色,將LlamaGen的生成步驟從256步減少到2,FID從25.70增加到28.95,增加幅度最小。作為首個展示圖像AR模型一步生成可能性的作品,DD挑戰了AR模型本質上緩慢的普遍觀念,並為高效的AR生成開辟了新機遇。項目網站位於https://imagination-research.github.io/distilled-decoding。
English
Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. DD doesn't need the training data of the original AR model, making it more practical.We evaluate DD on state-of-the-art image AR models and present promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation (6.3times speed-up), with an acceptable increase in FID from 4.19 to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8times speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID>100. DD also excels on text-to-image generation, reducing the generation from 256 steps to 2 for LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The project website is at https://imagination-research.github.io/distilled-decoding.

Summary

AI-Generated Summary

PDF342December 24, 2024