시연에 의한 비디오 생성

초록

우리는 새로운 비디오 생성 경험, 즉 시연을 통한 비디오 생성을 탐구합니다. 시연 비디오와 다른 장면의 컨텍스트 이미지가 주어졌을 때, 우리는 컨텍스트 이미지에서 자연스럽게 이어지며 시연에서 제시된 행동 개념을 수행하는 물리적으로 타당한 비디오를 생성합니다. 이 능력을 가능하게 하기 위해 우리는 미지도 학습 방식인 delta-Diffusion을 제시합니다. 이 방법은 미래 프레임 예측에 조건을 걸어 라벨이 없는 비디오로부터 학습합니다. 대부분의 기존 비디오 생성 제어 방법이 명시적 신호에 기반한 것과는 달리, 우리는 일반적인 비디오에서 요구되는 최대한의 유연성과 표현력을 위해 암묵적 잠재 제어 형태를 채택합니다. 우리는 상단에 외관 병목 설계를 갖춘 비디오 기반 모델을 활용하여 시연 비디오에서 행동 잠재 변수를 추출하여 생성 프로세스를 최소한의 외관 누출로 조건부화합니다. 경험적으로, delta-Diffusion은 인간의 선호도와 대규모 기계 평가 측면에서 관련 기준 모델을 능가하며 상호작용하는 세계 시뮬레이션 방향으로의 잠재력을 보여줍니다. 샘플 비디오 생성 결과는 https://delta-diffusion.github.io/에서 확인할 수 있습니다.

English

We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present delta-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, delta-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation. Sampled video generation results are available at https://delta-diffusion.github.io/.

시연에 의한 비디오 생성

Video Creation by Demonstration

초록

Summary

Support