TransPixar:透明度提升文本到影片生成
TransPixar: Advancing Text-to-Video Generation with Transparency
January 6, 2025
作者: Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen
cs.AI
摘要
文字轉視頻生成模型已取得顯著進展,使得在娛樂、廣告和教育等領域有各種應用。然而,生成包含透明度的 RGBA 視頻仍然是一個挑戰,這是由於數據集有限以及現有模型的適應困難所致。Alpha 通道對於視覺效果(VFX)至關重要,它允許像煙霧和反射這樣的透明元素無縫融入場景中。我們介紹了 TransPixar,這是一種擴展預訓練視頻模型以生成 RGBA 的方法,同時保留原始的 RGB 功能。TransPixar 利用擴散變壓器(DiT)架構,結合了特定於 alpha 的標記,並使用 LoRA 為基礎的微調,以高一致性地共同生成 RGB 和 alpha 通道。通過優化注意機制,TransPixar 保留了原始 RGB 模型的優勢,並實現了盡管訓練數據有限,RGB 和 alpha 通道之間的良好對齊。我們的方法有效生成多樣且一致的 RGBA 視頻,推動了 VFX 和互動內容創作的可能性。
English
Text-to-video generative models have made significant strides, enabling
diverse applications in entertainment, advertising, and education. However,
generating RGBA video, which includes alpha channels for transparency, remains
a challenge due to limited datasets and the difficulty of adapting existing
models. Alpha channels are crucial for visual effects (VFX), allowing
transparent elements like smoke and reflections to blend seamlessly into
scenes. We introduce TransPixar, a method to extend pretrained video models for
RGBA generation while retaining the original RGB capabilities. TransPixar
leverages a diffusion transformer (DiT) architecture, incorporating
alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB
and alpha channels with high consistency. By optimizing attention mechanisms,
TransPixar preserves the strengths of the original RGB model and achieves
strong alignment between RGB and alpha channels despite limited training data.
Our approach effectively generates diverse and consistent RGBA videos,
advancing the possibilities for VFX and interactive content creation.Summary
AI-Generated Summary