TransPixar:通过透明度推动文本到视频生成的进展

TransPixar: Advancing Text-to-Video Generation with Transparency

January 6, 2025
作者: Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen
cs.AI

摘要

文本到视频生成模型已经取得了重大进展,使得在娱乐、广告和教育等领域有了多样化的应用。然而,生成包含透明度通道的RGBA视频仍然是一个挑战,这是由于数据集有限以及现有模型的调整困难所致。透明度通道对于视觉效果(VFX)至关重要,它允许像烟雾和反射这样的透明元素无缝融入场景中。我们引入了TransPixar,这是一种用于扩展预训练视频模型以生成RGBA的方法,同时保留原始的RGB功能。TransPixar利用扩散变压器(DiT)架构,结合了特定于透明度的令牌,并使用基于LoRA的微调来共同生成具有高一致性的RGB和透明度通道。通过优化注意力机制,TransPixar保留了原始RGB模型的优势,并在训练数据有限的情况下实现了RGB和透明度通道之间的强对齐。我们的方法有效地生成多样且一致的RGBA视频,推动了VFX和交互内容创作的可能性。
English
Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.

Summary

AI-Generated Summary

PDF224January 7, 2025