扩展扩展去噪步骤的扩散模型推理时间缩放
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
January 16, 2025
作者: Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, Saining Xie
cs.AI
摘要
生成模型在各个领域产生了重大影响,这在很大程度上归功于它们在训练过程中能够通过增加数据、计算资源和模型规模来实现规模化的能力,这种现象被称为缩放定律。最近的研究开始探索大型语言模型(LLMs)在推断时的缩放行为,揭示了通过在推断过程中增加额外计算如何进一步提高性能。与LLMs不同,扩散模型固有地具有通过去噪步数来调整推断时计算的灵活性,尽管性能增益通常在几十个步骤后趋于平缓。在这项工作中,我们探索了扩散模型在推断时的缩放行为,超越了增加去噪步骤,并调查了如何通过增加计算来进一步提高生成性能。具体来说,我们考虑了一个旨在识别扩散采样过程中更好噪声的搜索问题。我们沿着两个轴线构建设计空间:用于提供反馈的验证器,以及用于找到更好噪声候选的算法。通过在基于类别和文本的图像生成基准上进行大量实验,我们的研究结果显示,增加推断时计算会显著提高扩散模型生成的样本质量,并且随着图像的复杂性,可以特别选择框架中的组件组合以符合不同的应用场景。
English
Generative models have made significant impacts across various domains,
largely due to their ability to scale during training by increasing data,
computational resources, and model size, a phenomenon characterized by the
scaling laws. Recent research has begun to explore inference-time scaling
behavior in Large Language Models (LLMs), revealing how performance can further
improve with additional computation during inference. Unlike LLMs, diffusion
models inherently possess the flexibility to adjust inference-time computation
via the number of denoising steps, although the performance gains typically
flatten after a few dozen. In this work, we explore the inference-time scaling
behavior of diffusion models beyond increasing denoising steps and investigate
how the generation performance can further improve with increased computation.
Specifically, we consider a search problem aimed at identifying better noises
for the diffusion sampling process. We structure the design space along two
axes: the verifiers used to provide feedback, and the algorithms used to find
better noise candidates. Through extensive experiments on class-conditioned and
text-conditioned image generation benchmarks, our findings reveal that
increasing inference-time compute leads to substantial improvements in the
quality of samples generated by diffusion models, and with the complicated
nature of images, combinations of the components in the framework can be
specifically chosen to conform with different application scenario.Summary
AI-Generated Summary