DreamVideo-2: 정확한 동작 제어를 통한 제로샷 주제 중심 비디오 맞춤화

초록

최근 맞춤형 비디오 생성 기술의 발전으로 사용자들은 특정 주제와 동작 궤적에 맞는 비디오를 만들 수 있게 되었습니다. 그러나 기존 방법들은 종종 복잡한 테스트 시간 미세 조정이 필요하며 주제 학습과 동작 제어의 균형을 유지하는 데 어려움을 겪어 실제 세계 응용에 제약을 받습니다. 본 논문에서는 단일 이미지와 바운딩 박스 순서에 의해 안내되는 특정 주제와 동작 궤적을 가진 비디오를 생성할 수 있는 제로샷 비디오 맞춤형 프레임워크인 DreamVideo-2를 제안합니다. 이를 위해 모델의 내재된 능력을 활용한 참조 주의(reference attention)를 소개하고, 바운딩 박스에서 파생된 상자 마스크의 강력한 동작 신호를 완전히 활용하여 정확한 동작 제어를 달성하기 위한 마스크 안내 동작 모듈을 고안합니다. 이 두 구성 요소가 의도한 기능을 달성하는 동안 우리는 경험적으로 동작 제어가 주제 학습을 압도하는 경향을 관찰합니다. 이를 해결하기 위해 우리는 1) blended latent mask 모델링 체계를 참조 주의에 통합하여 원하는 위치에서 주제 표현을 강화하는 마스크 참조 주의(masked reference attention)와 2) 바운딩 박스 내부와 외부 영역의 기여도를 구분하여 주제와 동작 제어 사이의 균형을 보장하는 다시 가중 확산 손실(reweighted diffusion loss)을 제안합니다. 새롭게 정리된 데이터셋에 대한 방대한 실험 결과는 DreamVideo-2가 주제 맞춤 및 동작 제어 모두에서 최신 기술을 능가한다는 것을 입증합니다. 데이터셋, 코드 및 모델은 공개적으로 제공될 예정입니다.

English

Recent advances in customized video generation have enabled users to create videos tailored to both specific subjects and motion trajectories. However, existing methods often require complicated test-time fine-tuning and struggle with balancing subject learning and motion control, limiting their real-world applications. In this paper, we present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory, guided by a single image and a bounding box sequence, respectively, and without the need for test-time fine-tuning. Specifically, we introduce reference attention, which leverages the model's inherent capabilities for subject learning, and devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks derived from bounding boxes. While these two components achieve their intended functions, we empirically observe that motion control tends to dominate over subject learning. To address this, we propose two key designs: 1) the masked reference attention, which integrates a blended latent mask modeling scheme into reference attention to enhance subject representations at the desired positions, and 2) a reweighted diffusion loss, which differentiates the contributions of regions inside and outside the bounding boxes to ensure a balance between subject and motion control. Extensive experimental results on a newly curated dataset demonstrate that DreamVideo-2 outperforms state-of-the-art methods in both subject customization and motion control. The dataset, code, and models will be made publicly available.

DreamVideo-2: 정확한 동작 제어를 통한 제로샷 주제 중심 비디오 맞춤화

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

초록

Summary

Support