迈向基于视觉语言模型规划的物理可信视频生成
Towards Physically Plausible Video Generation via VLM Planning
March 30, 2025
作者: Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, Xu Jia
cs.AI
摘要
近年来,视频扩散模型(VDMs)取得了显著进展,能够生成高度逼真的视频,并因其作为世界模拟器的潜力而受到广泛关注。然而,尽管VDMs具备强大的生成能力,但由于其内在缺乏对物理规律的理解,往往无法生成物理上合理的视频,导致动态和事件序列出现错误。为解决这一局限,我们提出了一种新颖的两阶段图像到视频生成框架,该框架显式地融入了物理知识。在第一阶段,我们采用视觉语言模型(VLM)作为粗粒度运动规划器,结合思维链和物理感知推理,预测近似真实世界物理动态的粗略运动轨迹/变化,同时确保帧间一致性。在第二阶段,我们利用预测的运动轨迹/变化来指导VDM的视频生成。由于预测的运动轨迹/变化较为粗略,在推理过程中会添加噪声,为VDM在生成更精细运动细节时提供自由度。大量实验结果表明,我们的框架能够生成物理上合理的运动,对比评估也凸显了我们的方法相较于现有技术的显著优势。更多视频结果请访问我们的项目页面:https://madaoer.github.io/projects/physically_plausible_video_generation。
English
Video diffusion models (VDMs) have advanced significantly in recent years,
enabling the generation of highly realistic videos and drawing the attention of
the community in their potential as world simulators. However, despite their
capabilities, VDMs often fail to produce physically plausible videos due to an
inherent lack of understanding of physics, resulting in incorrect dynamics and
event sequences. To address this limitation, we propose a novel two-stage
image-to-video generation framework that explicitly incorporates physics. In
the first stage, we employ a Vision Language Model (VLM) as a coarse-grained
motion planner, integrating chain-of-thought and physics-aware reasoning to
predict a rough motion trajectories/changes that approximate real-world
physical dynamics while ensuring the inter-frame consistency. In the second
stage, we use the predicted motion trajectories/changes to guide the video
generation of a VDM. As the predicted motion trajectories/changes are rough,
noise is added during inference to provide freedom to the VDM in generating
motion with more fine details. Extensive experimental results demonstrate that
our framework can produce physically plausible motion, and comparative
evaluations highlight the notable superiority of our approach over existing
methods. More video results are available on our Project Page:
https://madaoer.github.io/projects/physically_plausible_video_generation.Summary
AI-Generated Summary