无需引导的视觉生成

摘要

在各种视觉生成模型中，无分类器引导（CFG）已成为一种默认技术，但在采样过程中需要同时进行条件模型和无条件模型的推断。我们提出构建无引导采样的视觉模型。由此产生的算法，无引导训练（GFT），在将采样减少到单个模型的同时，与CFG的性能相匹配，将计算成本减半。与先前依赖预训练的CFG网络的蒸馏方法不同，GFT可以直接从头开始训练。GFT实现简单。它保留了与CFG相同的最大似然目标，主要区别在于条件模型的参数化。实现GFT只需要对现有代码库进行最少的修改，因为大多数设计选择和超参数直接继承自CFG。我们在五种不同的视觉模型上进行了大量实验，展示了GFT的有效性和多功能性。在扩散、自回归和掩蔽预测建模领域，GFT始终实现了与CFG基线相媲美甚至更低的FID分数，同时在无引导的情况下保持了类似的多样性-保真度权衡。代码将在https://github.com/thu-ml/GFT 上提供。

English

Classifier-Free Guidance (CFG) has been a default technique in various visual generative models, yet it requires inference from both conditional and unconditional models during sampling. We propose to build visual models that are free from guided sampling. The resulting algorithm, Guidance-Free Training (GFT), matches the performance of CFG while reducing sampling to a single model, halving the computational cost. Unlike previous distillation-based approaches that rely on pretrained CFG networks, GFT enables training directly from scratch. GFT is simple to implement. It retains the same maximum likelihood objective as CFG and differs mainly in the parameterization of conditional models. Implementing GFT requires only minimal modifications to existing codebases, as most design choices and hyperparameters are directly inherited from CFG. Our extensive experiments across five distinct visual models demonstrate the effectiveness and versatility of GFT. Across domains of diffusion, autoregressive, and masked-prediction modeling, GFT consistently achieves comparable or even lower FID scores, with similar diversity-fidelity trade-offs compared with CFG baselines, all while being guidance-free. Code will be available at https://github.com/thu-ml/GFT.

无需引导的视觉生成

Visual Generation Without Guidance

摘要

Summary

Support