CustomVideoX:3D参考注意力驱动的动态适应,用于零样本定制视频扩散变压器。
CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers
February 10, 2025
作者: D. She, Mushui Liu, Jingxuan Pang, Jin Wang, Zhen Yang, Wanggui He, Guanghao Zhang, Yi Wang, Qihan Huang, Haobin Tang, Yunlong Yu, Siming Fu
cs.AI
摘要
定制生成在图像合成方面取得了显著进展,然而由于时间不一致性和质量降低,个性化视频生成仍然具有挑战性。本文介绍了CustomVideoX,这是一种创新框架,利用视频扩散变换器从参考图像生成个性化视频。CustomVideoX利用预训练视频网络,通过专门训练LoRA参数来提取参考特征,确保效率和适应性。为了促进参考图像与视频内容之间的无缝交互,我们提出了3D参考注意力,实现了参考图像特征与所有视频帧在空间和时间维度上的直接和同时交互。为了在推理过程中减轻参考图像特征和文本指导对生成视频内容的过度影响,我们实施了时间感知参考注意力偏置(TAB)策略,动态调节不同时间步长上的参考偏置。此外,我们引入了实体区域感知增强(ERAE)模块,通过调整注意力偏置,将关键实体标记的高度激活区域与参考特征注入对齐。为了全面评估个性化视频生成,我们建立了一个新的基准VideoBench,包括50多个对象和100个提示,用于广泛评估。实验结果表明,CustomVideoX在视频一致性和质量方面明显优于现有方法。
English
Customized generation has achieved significant progress in image synthesis,
yet personalized video generation remains challenging due to temporal
inconsistencies and quality degradation. In this paper, we introduce
CustomVideoX, an innovative framework leveraging the video diffusion
transformer for personalized video generation from a reference image.
CustomVideoX capitalizes on pre-trained video networks by exclusively training
the LoRA parameters to extract reference features, ensuring both efficiency and
adaptability. To facilitate seamless interaction between the reference image
and video content, we propose 3D Reference Attention, which enables direct and
simultaneous engagement of reference image features with all video frames
across spatial and temporal dimensions. To mitigate the excessive influence of
reference image features and textual guidance on generated video content during
inference, we implement the Time-Aware Reference Attention Bias (TAB) strategy,
dynamically modulating reference bias over different time steps. Additionally,
we introduce the Entity Region-Aware Enhancement (ERAE) module, aligning highly
activated regions of key entity tokens with reference feature injection by
adjusting attention bias. To thoroughly evaluate personalized video generation,
we establish a new benchmark, VideoBench, comprising over 50 objects and 100
prompts for extensive assessment. Experimental results show that CustomVideoX
significantly outperforms existing methods in terms of video consistency and
quality.Summary
AI-Generated Summary