利用变分噪声耦合训练一致性模型

摘要

一致性训练（Consistency Training, CT）作为扩散模型的一种有前景的替代方案，近期在图像生成任务中展现出竞争力。然而，非蒸馏一致性训练常面临高方差与不稳定性问题，其训练动态的分析与改进成为研究热点。本研究提出了一种基于流匹配（Flow Matching）框架的新型CT训练方法。我们的核心贡献在于，受变分自编码器（VAE）架构启发，设计了一种训练噪声耦合机制。通过训练一个作为编码器架构实现的数据依赖噪声发射模型，我们的方法能够间接学习噪声到数据映射的几何结构，而这一映射在经典CT中是由前向过程的选择固定的。跨多个图像数据集的实证结果显示，该方法在生成性能上取得显著提升，我们的模型在CIFAR-10上超越了基线，达到了非蒸馏CT的最优FID（Frechet Inception Distance）值，并在ImageNet 64×64分辨率下，以两步生成的方式实现了与当前最优水平相当的FID。相关代码已公开于https://github.com/sony/vct。

English

Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, non-distillation consistency training often suffers from high variance and instability, and analyzing and improving its training dynamics is an active area of research. In this work, we propose a novel CT training approach based on the Flow Matching framework. Our main contribution is a trained noise-coupling scheme inspired by the architecture of Variational Autoencoders (VAE). By training a data-dependent noise emission model implemented as an encoder architecture, our method can indirectly learn the geometry of the noise-to-data mapping, which is instead fixed by the choice of the forward process in classical CT. Empirical results across diverse image datasets show significant generative improvements, with our model outperforming baselines and achieving the state-of-the-art (SoTA) non-distillation CT FID on CIFAR-10, and attaining FID on par with SoTA on ImageNet at 64 times 64 resolution in 2-step generation. Our code is available at https://github.com/sony/vct .

利用变分噪声耦合训练一致性模型

Training Consistency Models with Variational Noise Coupling

摘要

Summary

Support

Support