幻影：通过跨模态对齐实现主题一致性的视频生成

摘要

视频生成基础模型的持续发展正逐步拓展至多样化应用领域，其中主体一致性视频生成仍处于探索阶段。我们将其称为“主体到视频”（Subject-to-Video），该技术从参考图像中提取主体元素，并通过文本指令生成与主体保持一致的视频。我们认为，主体到视频的核心在于平衡文本与图像的双模态提示，从而深度且同步地对齐文本与视觉内容。为此，我们提出了Phantom，一个适用于单主体及多主体参考的统一视频生成框架。在现有文本到视频和图像到视频架构的基础上，我们重新设计了联合文本-图像注入模型，并通过文本-图像-视频三元组数据驱动其学习跨模态对齐。特别地，我们在人物生成中强调主体一致性，不仅涵盖了现有的身份保持视频生成，还提供了更优的性能表现。项目主页请访问：https://phantom-video.github.io/Phantom/。

English

The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.

幻影：通过跨模态对齐实现主题一致性的视频生成

Phantom: Subject-consistent video generation via cross-modal alignment

摘要

Summary

Support