비디오 생성과 월드 모델 간의 거리: 물리 법칙 관점에서

초록

OpenAI의 Sora는 비디오 생성의 잠재력을 강조하여 기본 물리 법칙을 준수하는 세계 모델을 개발하는 데 기여합니다. 그러나 시각 데이터만을 사용하여 비디오 생성 모델이 이러한 법칙을 순수하게 발견할 수 있는 능력에 대해 의문을 제기할 수 있습니다. 진정한 법칙을 학습하는 세계 모델은 세세한 점에 강건하고 보이지 않는 시나리오에서 올바르게 추정해야 합니다. 본 연구에서는 세 가지 주요 시나리오를 효과적으로 평가합니다: 분포 내, 분포 외, 그리고 조합적 일반화. 우리는 물체 이동과 충돌을 위한 2D 시뮬레이션 테스트베드를 개발하여 고전 물리학 법칙에 의해 결정론적으로 지배되는 비디오를 생성합니다. 이를 통해 대규모 실험을 위한 무한한 데이터 공급을 제공하며 생성된 비디오가 물리 법칙을 준수하는지를 정량적으로 평가할 수 있습니다. 초기 프레임을 기반으로 물체 이동을 예측하기 위해 확산 기반 비디오 생성 모델을 훈련시켰습니다. 우리의 확장 실험은 분포 내에서 완벽한 일반화, 조합적 일반화에 대한 측정 가능한 확장 행동, 그러나 분포 외 시나리오에서의 실패를 보여줍니다. 추가 실험은 이러한 모델의 일반화 메커니즘에 대한 두 가지 중요한 통찰을 제공합니다: (1) 모델은 일반적인 물리적 규칙을 추상화하지 못하고 대신 "사례 기반" 일반화 행동, 즉 가장 가까운 훈련 예제를 모방합니다; (2) 새로운 케이스로 일반화할 때 모델은 훈련 데이터를 참조할 때 다른 요소를 우선시하는 것으로 관찰됩니다: 색상 > 크기 > 속도 > 모양. 우리의 연구는 Sora의 넓은 성공에서의 역할에도 불구하고 비디오 생성 모델이 기본적인 물리 법칙을 발견하는 데 단독으로 충분하지 않음을 시사합니다. 프로젝트 페이지는 https://phyworld.github.io에서 확인할 수 있습니다.

English

OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io

비디오 생성과 월드 모델 간의 거리: 물리 법칙 관점에서

How Far is Video Generation from World Model: A Physical Law Perspective

초록

Summary

Support