穿越口罩:基于口罩的运动轨迹用于图像到视频的生成
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
January 6, 2025
作者: Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, Adam Polyak
cs.AI
摘要
我们考虑图像到视频(I2V)生成的任务,这涉及根据文本描述将静态图像转换为逼真的视频序列。尽管最近的进展产生了逼真的输出,但它们经常难以创建具有准确和一致对象运动的视频,特别是在多对象场景中。为了解决这些限制,我们提出了一个两阶段的组合框架,将I2V生成分解为:(i)明确的中间表示生成阶段,然后是(ii)在此表示条件下的视频生成阶段。我们的关键创新是引入基于掩模的运动轨迹作为中间表示,捕捉语义对象信息和运动,实现了运动和语义的表达丰富而紧凑的表示。为了在第二阶段中整合学习到的表示,我们利用对象级别的注意力目标。具体来说,我们考虑了一个空间、每个对象的掩模交叉注意力目标,将对象特定提示集成到相应的潜在空间区域中,以及一个掩模空间-时间自注意力目标,确保每个对象在帧与帧之间的一致性。我们在具有多对象和高运动场景的具有挑战性的基准测试上评估了我们的方法,并在实证上证明了所提方法在时间连贯性、运动逼真度和文本提示忠实度方面取得了最先进的结果。此外,我们引入了一个新的具有挑战性的基准测试\benchmark,用于单对象和多对象I2V生成,并展示了我们的方法在这一基准测试上的优越性。项目页面位于https://guyyariv.github.io/TTM/。
English
We consider the task of Image-to-Video (I2V) generation, which involves
transforming static images into realistic video sequences based on a textual
description. While recent advancements produce photorealistic outputs, they
frequently struggle to create videos with accurate and consistent object
motion, especially in multi-object scenarios. To address these limitations, we
propose a two-stage compositional framework that decomposes I2V generation
into: (i) An explicit intermediate representation generation stage, followed by
(ii) A video generation stage that is conditioned on this representation. Our
key innovation is the introduction of a mask-based motion trajectory as an
intermediate representation, that captures both semantic object information and
motion, enabling an expressive but compact representation of motion and
semantics. To incorporate the learned representation in the second stage, we
utilize object-level attention objectives. Specifically, we consider a spatial,
per-object, masked-cross attention objective, integrating object-specific
prompts into corresponding latent space regions and a masked spatio-temporal
self-attention objective, ensuring frame-to-frame consistency for each object.
We evaluate our method on challenging benchmarks with multi-object and
high-motion scenarios and empirically demonstrate that the proposed method
achieves state-of-the-art results in temporal coherence, motion realism, and
text-prompt faithfulness. Additionally, we introduce \benchmark, a new
challenging benchmark for single-object and multi-object I2V generation, and
demonstrate our method's superiority on this benchmark. Project page is
available at https://guyyariv.github.io/TTM/.Summary
AI-Generated Summary