视频生成离世界模型有多远：从物理定律的角度看

摘要

OpenAI的Sora突显了视频生成在发展符合基本物理定律的世界模型方面的潜力。然而，视频生成模型能否纯粹从视觉数据中发现这些定律而无需人类先验知识，这一点值得质疑。一个学习真实定律的世界模型应该能够对细微差异具有鲁棒性，并能在未见过的场景上正确外推。在这项工作中，我们跨越三个关键场景进行评估：分布内、分布外和组合泛化。我们为物体运动和碰撞开发了一个二维模拟测试平台，以确定性地生成受一个或多个古典力学定律控制的视频。这为大规模实验提供了无限的数据，并能够定量评估生成的视频是否符合物理定律。我们训练了基于扩散的视频生成模型，以预测基于初始帧的物体运动。我们的扩展实验显示在分布内具有完美泛化，在组合泛化中具有可测量的缩放行为，但在分布外场景中失败。进一步的实验揭示了关于这些模型泛化机制的两个关键见解：（1）模型无法抽象出一般物理规则，而是表现出“基于案例”的泛化行为，即模仿最接近的训练示例；（2）在泛化到新案例时，观察到模型在参考训练数据时会优先考虑不同因素：颜色 > 尺寸 > 速度 > 形状。我们的研究表明，仅靠扩展是不足以让视频生成模型揭示基本物理定律的，尽管在Sora的更广泛成功中扮演了角色。请访问我们的项目页面：https://phyworld.github.io

English

OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io

视频生成离世界模型有多远：从物理定律的角度看

How Far is Video Generation from World Model: A Physical Law Perspective

摘要

Summary

Support

Support