從物理定律的角度來看,影片生成與世界模型有多遠?
How Far is Video Generation from World Model: A Physical Law Perspective
November 4, 2024
作者: Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng
cs.AI
摘要
OpenAI 的 Sora 強調了影片生成的潛力,可以發展符合基本物理法則的世界模型。然而,影片生成模型能否純粹從視覺數據中發現這些法則,而不依賴人類先驗知識,這一點值得懷疑。一個學習真實法則的世界模型應該能夠對微妙之處做出堅固的預測,並能在未見過的情況下正確推斷。在這項工作中,我們在三個關鍵情境中進行評估:分布內、分布外以及組合泛化。我們開發了一個二維模擬測試平台,用於對物體運動和碰撞進行生成影片,這些影片受一個或多個古典力學法則的確定性控制。這為大規模實驗提供了無限的數據,並能夠量化評估生成的影片是否遵循物理法則。我們訓練了基於擴散的影片生成模型,以預測基於初始幀的物體運動。我們的擴展實驗表明,在分布內實現了完美泛化,在組合泛化中呈現可測量的擴展行為,但在分布外情境中失敗了。進一步的實驗揭示了關於這些模型泛化機制的兩個關鍵見解:(1) 這些模型無法抽象出一般物理規則,而是展現出“基於案例”的泛化行為,即模仿最接近的訓練示例;(2) 在泛化到新案例時,觀察到模型在參考訓練數據時會優先考慮不同因素:顏色 > 尺寸 > 速度 > 形狀。我們的研究表明,僅靠擴展本身是不足以讓影片生成模型揭示基本物理法則的,儘管在 Sora 的更廣泛成功中扮演了一定角色。請查看我們的項目頁面:https://phyworld.github.io
English
OpenAI's Sora highlights the potential of video generation for developing
world models that adhere to fundamental physical laws. However, the ability of
video generation models to discover such laws purely from visual data without
human priors can be questioned. A world model learning the true law should give
predictions robust to nuances and correctly extrapolate on unseen scenarios. In
this work, we evaluate across three key scenarios: in-distribution,
out-of-distribution, and combinatorial generalization. We developed a 2D
simulation testbed for object movement and collisions to generate videos
deterministically governed by one or more classical mechanics laws. This
provides an unlimited supply of data for large-scale experimentation and
enables quantitative evaluation of whether the generated videos adhere to
physical laws. We trained diffusion-based video generation models to predict
object movements based on initial frames. Our scaling experiments show perfect
generalization within the distribution, measurable scaling behavior for
combinatorial generalization, but failure in out-of-distribution scenarios.
Further experiments reveal two key insights about the generalization mechanisms
of these models: (1) the models fail to abstract general physical rules and
instead exhibit "case-based" generalization behavior, i.e., mimicking the
closest training example; (2) when generalizing to new cases, models are
observed to prioritize different factors when referencing training data: color
> size > velocity > shape. Our study suggests that scaling alone is
insufficient for video generation models to uncover fundamental physical laws,
despite its role in Sora's broader success. See our project page at
https://phyworld.github.ioSummary
AI-Generated Summary