目标感知视频扩散模型

摘要

我们提出了一种目标感知的视频扩散模型，该模型能够从输入图像生成视频，其中演员在执行期望动作的同时与指定目标进行交互。目标通过分割掩码定义，而期望动作则通过文本提示描述。与现有的可控图像到视频扩散模型不同，后者通常依赖密集的结构或运动线索来引导演员向目标移动，我们的目标感知模型仅需一个简单的掩码来指示目标，利用预训练模型的泛化能力生成合理的动作。这使得我们的方法在人物-物体交互（HOI）场景中尤为有效，因为在这些场景中提供精确的动作指导具有挑战性，并进一步推动了视频扩散模型在机器人等高层次动作规划应用中的使用。我们通过扩展基线模型，将目标掩码作为额外输入，构建了目标感知模型。为了增强目标感知能力，我们引入了一个特殊标记，该标记在文本提示中编码目标的空间信息。随后，我们使用一种新颖的交叉注意力损失对模型进行微调，该损失将与该标记相关的交叉注意力图与输入目标掩码对齐。为了进一步提升性能，我们选择性地将此损失应用于最具语义相关性的Transformer块和注意力区域。实验结果表明，我们的目标感知模型在生成演员与指定目标准确交互的视频方面优于现有解决方案。我们进一步展示了其在两个下游应用中的有效性：视频内容创作和零样本3D HOI运动合成。

English

We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask and the desired action is described via a text prompt. Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor's movements toward the target, our target-aware model requires only a simple mask to indicate the target, leveraging the generalization capabilities of pretrained models to produce plausible actions. This makes our method particularly effective for human-object interaction (HOI) scenarios, where providing precise action guidance is challenging, and further enables the use of video diffusion models for high-level action planning in applications such as robotics. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant transformer blocks and attention regions. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: video content creation and zero-shot 3D HOI motion synthesis.

目标感知视频扩散模型

Target-Aware Video Diffusion Models

摘要

Summary

Support

Support