潜在一致性模型的改进训练技术

摘要

一致性模型是一类新的生成模型，能够在单步或多步中生成高质量样本。最近，一致性模型展现出令人印象深刻的性能，与像素空间中的扩散模型取得了相媲美的结果。然而，将一致性训练扩展到大规模数据集的成功，特别是针对文本到图像和视频生成任务，取决于潜在空间中的性能。在这项工作中，我们分析了像素空间和潜在空间之间的统计差异，发现潜在数据通常包含高度冲动的异常值，显著降低了潜在空间中一致性训练的性能。为了解决这个问题，我们用柯西损失替换了伪胡伯损失，有效地减轻了异常值的影响。此外，我们在早期时间步引入了扩散损失，并采用最优输运（OT）耦合来进一步增强性能。最后，我们引入了自适应缩放调度器来管理稳健的训练过程，并在架构中采用非缩放层归一化来更好地捕捉特征的统计信息并减少异常值的影响。通过这些策略，我们成功训练了能够在一到两步内生成高质量样本的潜在一致性模型，显著缩小了潜在一致性模型与扩散模型之间的性能差距。实现代码发布在这里：https://github.com/quandao10/sLCT/

English

Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling-c scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models. The implementation is released here: https://github.com/quandao10/sLCT/

潜在一致性模型的改进训练技术

Improved Training Technique for Latent Consistency Models

摘要

Summary

Support