幻影:通过跨模态对齐实现主题一致性的视频生成
Phantom: Subject-consistent video generation via cross-modal alignment
February 16, 2025
作者: Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu
cs.AI
摘要
视频生成基础模型的持续发展正逐步拓展至多样化应用领域,其中主体一致性视频生成仍处于探索阶段。我们将其称为“主体到视频”(Subject-to-Video),该技术从参考图像中提取主体元素,并通过文本指令生成与主体保持一致的视频。我们认为,主体到视频的核心在于平衡文本与图像的双模态提示,从而深度且同步地对齐文本与视觉内容。为此,我们提出了Phantom,一个适用于单主体及多主体参考的统一视频生成框架。在现有文本到视频和图像到视频架构的基础上,我们重新设计了联合文本-图像注入模型,并通过文本-图像-视频三元组数据驱动其学习跨模态对齐。特别地,我们在人物生成中强调主体一致性,不仅涵盖了现有的身份保持视频生成,还提供了更优的性能表现。项目主页请访问:https://phantom-video.github.io/Phantom/。
English
The continuous development of foundational models for video generation is
evolving into various applications, with subject-consistent video generation
still in the exploratory stage. We refer to this as Subject-to-Video, which
extracts subject elements from reference images and generates
subject-consistent video through textual instructions. We believe that the
essence of subject-to-video lies in balancing the dual-modal prompts of text
and image, thereby deeply and simultaneously aligning both text and visual
content. To this end, we propose Phantom, a unified video generation framework
for both single and multi-subject references. Building on existing
text-to-video and image-to-video architectures, we redesign the joint
text-image injection model and drive it to learn cross-modal alignment via
text-image-video triplet data. In particular, we emphasize subject consistency
in human generation, covering existing ID-preserving video generation while
offering enhanced advantages. The project homepage is here
https://phantom-video.github.io/Phantom/.Summary
AI-Generated Summary