ChatPaper.aiChatPaper

MIMO:具可控特性的角色影片合成與空間分解建模

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

September 24, 2024
作者: Yifang Men, Yuan Yao, Miaomiao Cui, Liefeng Bo
cs.AI

摘要

角色視頻合成旨在在逼真場景中生成可動角色的真實視頻。作為計算機視覺和圖形學社區中的一個基本問題,3D作品通常需要多視角捕獲進行每個案例的訓練,這嚴重限制了對於在短時間內對任意角色進行建模的應用性。最近的2D方法通過預先訓練的擴散模型打破了這種限制,但它們在姿勢泛用性和場景互動方面存在困難。為此,我們提出了MIMO,一種新穎的框架,不僅可以合成具有可控屬性的角色視頻(即角色、動作和場景),這些屬性由簡單的用戶輸入提供,還可以同時實現對任意角色的高度擴展性、對新型3D動作的泛用性以及對互動式現實場景的應用性在一個統一的框架中。其核心思想是將2D視頻編碼為緊湊的空間代碼,考慮到視頻發生的固有3D性質。具體而言,我們使用單眼深度估算器將2D幀像素提升為3D,並根據3D深度將視頻剪輯分解為三個空間組件(即主要人物、底層場景和浮動遮蔽),這些組件進一步編碼為規範身份代碼、結構化動作代碼和完整場景代碼,這些代碼被用作合成過程的控制信號。空間分解建模的設計實現了靈活的用戶控制、複雜的運動表達,以及對場景互動的3D感知合成。實驗結果證明了所提方法的有效性和韌性。
English
Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. The design of spatial decomposed modeling enables flexible user control, complex motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate effectiveness and robustness of the proposed method.

Summary

AI-Generated Summary

PDF342November 16, 2024