透過口罩:基於口罩的運動軌跡用於圖像到視頻的生成
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
January 6, 2025
作者: Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, Adam Polyak
cs.AI
摘要
我們考慮了圖像到視頻(I2V)生成的任務,這涉及根據文本描述將靜態圖像轉換為逼真的視頻序列。儘管最近的進展產生了逼真的輸出,但它們經常在多對象情況下難以創建具有準確和一致對象運動的視頻。為了解決這些限制,我們提出了一種兩階段的組合框架,將I2V生成分解為:(i)明確的中間表示生成階段,然後是(ii)在此表示條件下的視頻生成階段。我們的關鍵創新是引入基於遮罩的運動軌跡作為中間表示,捕捉語義對象信息和運動,實現運動和語義的表達豐富而緊湊的表示。為了在第二階段中融入學習到的表示,我們利用對象級注意力目標。具體來說,我們考慮了一個空間、每對象、遮罩交叉注意力目標,將對象特定提示集成到相應的潛在空間區域中,以及一個遮罩的時空自注意力目標,確保每個對象的幀間一致性。我們在具有多對象和高運動情景的具有挑戰性的基準測試中評估了我們的方法,並在實驗中證明了所提出的方法在時間連貫性、運動逼真度和文本提示忠實度方面取得了最先進的結果。此外,我們引入了一個新的具有挑戰性的基準測試 \benchmark,用於單對象和多對象I2V生成,並展示了我們的方法在這個基準測試中的優越性。項目頁面位於 https://guyyariv.github.io/TTM/。
English
We consider the task of Image-to-Video (I2V) generation, which involves
transforming static images into realistic video sequences based on a textual
description. While recent advancements produce photorealistic outputs, they
frequently struggle to create videos with accurate and consistent object
motion, especially in multi-object scenarios. To address these limitations, we
propose a two-stage compositional framework that decomposes I2V generation
into: (i) An explicit intermediate representation generation stage, followed by
(ii) A video generation stage that is conditioned on this representation. Our
key innovation is the introduction of a mask-based motion trajectory as an
intermediate representation, that captures both semantic object information and
motion, enabling an expressive but compact representation of motion and
semantics. To incorporate the learned representation in the second stage, we
utilize object-level attention objectives. Specifically, we consider a spatial,
per-object, masked-cross attention objective, integrating object-specific
prompts into corresponding latent space regions and a masked spatio-temporal
self-attention objective, ensuring frame-to-frame consistency for each object.
We evaluate our method on challenging benchmarks with multi-object and
high-motion scenarios and empirically demonstrate that the proposed method
achieves state-of-the-art results in temporal coherence, motion realism, and
text-prompt faithfulness. Additionally, we introduce \benchmark, a new
challenging benchmark for single-object and multi-object I2V generation, and
demonstrate our method's superiority on this benchmark. Project page is
available at https://guyyariv.github.io/TTM/.Summary
AI-Generated Summary