팬텀: 크로스모달 정렬을 통한 주체 일관성 비디오 생성

초록

비디오 생성을 위한 기초 모델의 지속적인 발전은 다양한 응용 분야로 진화하고 있으며, 주체 일관성 비디오 생성은 여전히 탐색 단계에 있습니다. 우리는 이를 '주체-비디오(Subject-to-Video)'라고 부르며, 이는 참조 이미지에서 주체 요소를 추출하고 텍스트 지시를 통해 주체 일관성 비디오를 생성합니다. 우리는 주체-비디오의 본질이 텍스트와 이미지라는 이중 모달 프롬프트의 균형을 맞추고, 이를 통해 텍스트와 시각적 콘텐츠를 깊이 동시에 정렬하는 데 있다고 믿습니다. 이를 위해 우리는 단일 및 다중 주체 참조를 위한 통합 비디오 생성 프레임워크인 '팬텀(Phantom)'을 제안합니다. 기존의 텍스트-비디오 및 이미지-비디오 아키텍처를 기반으로, 우리는 공동 텍스트-이미지 주입 모델을 재설계하고 텍스트-이미지-비디오 삼중 데이터를 통해 교차 모달 정렬을 학습하도록 유도합니다. 특히, 우리는 인간 생성에서의 주체 일관성을 강조하며, 기존의 ID 보존 비디오 생성을 포괄하면서 향상된 이점을 제공합니다. 프로젝트 홈페이지는 https://phantom-video.github.io/Phantom/에서 확인할 수 있습니다.

English

The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.

팬텀: 크로스모달 정렬을 통한 주체 일관성 비디오 생성

Phantom: Subject-consistent video generation via cross-modal alignment

초록

Support