WorldSimBench: 비디오 생성 모델을 세계 시뮬레이터로서의 방향으로

초록

최근 예측 모델의 발전은 물체와 장면의 미래 상태를 예측하는 데 뛰어난 능력을 보여주었습니다. 그러나 본질적 특성에 기반한 분류의 부족은 예측 모델 개발의 진전을 방해하고 있습니다. 게다가 기존의 벤치마크는 높은 능력과 높은 체감성을 가진 예측 모델을 체감적 관점에서 효과적으로 평가할 수 없습니다. 본 연구에서는 예측 모델의 기능을 계층적으로 분류하고, World Simulator를 평가하기 위해 WorldSimBench라는 이중 평가 프레임워크를 제안하는 첫걸음을 내딛습니다. WorldSimBench에는 명시적 지각 평가와 암시적 조작 평가가 포함되어 있으며, 시각적 관점에서의 인간 선호도 평가와 체감적 작업에서의 행동 수준 평가를 포괄하며, Open-Ended Embodied Environment, Autonomous Driving, 그리고 Robot Manipulation의 세 가지 대표적인 체감적 시나리오를 다룹니다. 명시적 지각 평가에서는 섬세한 인간 피드백을 기반으로 한 비디오 평가 데이터셋인 HF-Embodied Dataset을 소개하고, 이를 사용하여 인간 지각과 일치하며 World Simulator의 시각적 충실도를 명시적으로 평가하는 Human Preference Evaluator를 훈련시킵니다. 암시적 조작 평가에서는 World Simulator의 비디오-행동 일관성을 평가하여 생성된 상황 인식 비디오가 동적 환경에서 올바른 제어 신호로 정확하게 변환될 수 있는지를 평가합니다. 우리의 포괄적인 평가는 비디오 생성 모델에 대한 핵심 통찰을 제공하며, World Simulator를 체감적 인공지능으로 나아가는 중요한 발전으로 위치시킵니다.

English

Recent advancements in predictive models have demonstrated exceptional capabilities in predicting the future state of objects and scenes. However, the lack of categorization based on inherent characteristics continues to hinder the progress of predictive model development. Additionally, existing benchmarks are unable to effectively evaluate higher-capability, highly embodied predictive models from an embodied perspective. In this work, we classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks, covering three representative embodied scenarios: Open-Ended Embodied Environment, Autonomous, Driving, and Robot Manipulation. In the Explicit Perceptual Evaluation, we introduce the HF-Embodied Dataset, a video assessment dataset based on fine-grained human feedback, which we use to train a Human Preference Evaluator that aligns with human perception and explicitly assesses the visual fidelity of World Simulators. In the Implicit Manipulative Evaluation, we assess the video-action consistency of World Simulators by evaluating whether the generated situation-aware video can be accurately translated into the correct control signals in dynamic environments. Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.

WorldSimBench: 비디오 생성 모델을 세계 시뮬레이터로서의 방향으로

WorldSimBench: Towards Video Generation Models as World Simulators

초록

Support