REPA-E:开启VAE端到端调优,结合潜在扩散与Transformer
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
April 14, 2025
作者: Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, Liang Zheng
cs.AI
摘要
本文探讨了一个根本性问题:“我们能否以端到端的方式联合训练潜在扩散模型与变分自编码器(VAE)分词器?”传统深度学习智慧认为,在可能的情况下,端到端训练往往更为可取。然而,对于潜在扩散变换器而言,观察到使用标准扩散损失同时端到端训练VAE和扩散模型效果不佳,甚至导致最终性能下降。我们证明,尽管扩散损失无效,但通过表示对齐(REPA)损失可以解锁端到端训练——允许在训练过程中同时调整VAE和扩散模型。尽管方法简单,所提出的训练方案(REPA-E)展现了显著性能;相较于REPA和基础训练方案,分别加速扩散模型训练超过17倍和45倍。有趣的是,我们注意到使用REPA-E进行端到端调优还能提升VAE本身;带来改进的潜在空间结构及下游生成性能。就最终性能而言,我们的方法确立了新的技术标杆;在ImageNet 256×256数据集上,无论是否使用无分类器指导,均实现了1.26和1.83的FID分数。代码发布于https://end2end-diffusion.github.io。
English
In this paper we tackle a fundamental question: "Can we train latent
diffusion models together with the variational auto-encoder (VAE) tokenizer in
an end-to-end manner?" Traditional deep-learning wisdom dictates that
end-to-end training is often preferable when possible. However, for latent
diffusion transformers, it is observed that end-to-end training both VAE and
diffusion-model using standard diffusion-loss is ineffective, even causing a
degradation in final performance. We show that while diffusion loss is
ineffective, end-to-end training can be unlocked through the
representation-alignment (REPA) loss -- allowing both VAE and diffusion model
to be jointly tuned during the training process. Despite its simplicity, the
proposed training recipe (REPA-E) shows remarkable performance; speeding up
diffusion model training by over 17x and 45x over REPA and vanilla training
recipes, respectively. Interestingly, we observe that end-to-end tuning with
REPA-E also improves the VAE itself; leading to improved latent space structure
and downstream generation performance. In terms of final performance, our
approach sets a new state-of-the-art; achieving FID of 1.26 and 1.83 with and
without classifier-free guidance on ImageNet 256 x 256. Code is available at
https://end2end-diffusion.github.io.Summary
AI-Generated Summary