ChatPaper.aiChatPaper

幻影:通过跨模态对齐实现主题一致性的视频生成

Phantom: Subject-consistent video generation via cross-modal alignment

February 16, 2025
作者: Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu
cs.AI

摘要

视频生成基础模型的持续发展正逐步拓展至多样化应用领域,其中主体一致性视频生成仍处于探索阶段。我们将其称为“主体到视频”(Subject-to-Video),该技术从参考图像中提取主体元素,并通过文本指令生成与主体保持一致的视频。我们认为,主体到视频的核心在于平衡文本与图像的双模态提示,从而深度且同步地对齐文本与视觉内容。为此,我们提出了Phantom,一个适用于单主体及多主体参考的统一视频生成框架。在现有文本到视频和图像到视频架构的基础上,我们重新设计了联合文本-图像注入模型,并通过文本-图像-视频三元组数据驱动其学习跨模态对齐。特别地,我们在人物生成中强调主体一致性,不仅涵盖了现有的身份保持视频生成,还提供了更优的性能表现。项目主页请访问:https://phantom-video.github.io/Phantom/。
English
The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.

Summary

AI-Generated Summary

PDF522February 19, 2025