迈向基于视觉语言模型规划的物理可信视频生成

摘要

近年来，视频扩散模型（VDMs）取得了显著进展，能够生成高度逼真的视频，并因其作为世界模拟器的潜力而受到广泛关注。然而，尽管VDMs具备强大的生成能力，但由于其内在缺乏对物理规律的理解，往往无法生成物理上合理的视频，导致动态和事件序列出现错误。为解决这一局限，我们提出了一种新颖的两阶段图像到视频生成框架，该框架显式地融入了物理知识。在第一阶段，我们采用视觉语言模型（VLM）作为粗粒度运动规划器，结合思维链和物理感知推理，预测近似真实世界物理动态的粗略运动轨迹/变化，同时确保帧间一致性。在第二阶段，我们利用预测的运动轨迹/变化来指导VDM的视频生成。由于预测的运动轨迹/变化较为粗略，在推理过程中会添加噪声，为VDM在生成更精细运动细节时提供自由度。大量实验结果表明，我们的框架能够生成物理上合理的运动，对比评估也凸显了我们的方法相较于现有技术的显著优势。更多视频结果请访问我们的项目页面：https://madaoer.github.io/projects/physically_plausible_video_generation。

English

Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: https://madaoer.github.io/projects/physically_plausible_video_generation.

迈向基于视觉语言模型规划的物理可信视频生成

Towards Physically Plausible Video Generation via VLM Planning

摘要

Summary

Support

Support