REPA-E：解鎖VAE以實現潛在擴散變壓器的端到端調優

摘要

本文探討了一個根本性問題：「我們能否以端到端的方式同時訓練潛在擴散模型與變分自編碼器（VAE）分詞器？」傳統深度學習的智慧表明，端到端訓練在可能的情況下通常是更優的選擇。然而，對於潛在擴散變換器而言，觀察到使用標準擴散損失同時訓練VAE和擴散模型是無效的，甚至會導致最終性能的下降。我們展示了，雖然擴散損失無效，但通過表示對齊（REPA）損失可以解鎖端到端訓練——允許在訓練過程中聯合調節VAE和擴散模型。儘管其簡單，所提出的訓練方案（REPA-E）展現了顯著的性能；相比REPA和傳統訓練方案，分別加速了擴散模型訓練超過17倍和45倍。有趣的是，我們觀察到使用REPA-E進行端到端調節也改善了VAE本身；導致潛在空間結構的改善以及下游生成性能的提升。就最終性能而言，我們的方法設定了新的技術前沿；在ImageNet 256 x 256上，無論是否使用無分類器指導，均達到了1.26和1.83的FID分數。代碼可在https://end2end-diffusion.github.io獲取。

English

In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.26 and 1.83 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.

REPA-E：解鎖VAE以實現潛在擴散變壓器的端到端調優

REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

摘要

Summary

Support

Support