ChatPaper.aiChatPaper

PhysGame:揭示游戏视频中的物理常识违规

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

December 2, 2024
作者: Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, Xiaodan Liang
cs.AI

摘要

最近在基于视频的大型语言模型(Video LLMs)方面取得的进展,见证了对推理和解释动态视觉内容的多样能力的出现。其中,游戏视频以其独特的数据来源而脱颖而出,通常包含违反物理常识的故障。这一特征使它们成为评估视频LLMs中未充分探索的物理常识理解能力的有效基准。在本文中,我们提出PhysGame作为一个开创性的基准,用于评估游戏视频中的物理常识违规行为。PhysGame包括880个视频,涵盖了四个基本领域(即力学、运动学、光学和材料属性)以及12个不同的物理常识。通过广泛评估各种最先进的视频LLMs,我们的研究结果显示,当前开源视频LLMs的性能明显落后于专有对手。为了弥补这一差距,我们整理了一个指导调整数据集PhysInstruct,包含14万零57个问答对,以促进物理常识学习。此外,我们还提出了一个偏好优化数据集PhysDPO,包含34,358个训练对,其中生成了不受欢迎的响应,条件是误导性标题(即元信息篡改)、较少帧(即时间篡改)和较低空间分辨率(即空间篡改)。基于这一系列数据集,我们提出了PhysVLM作为一种物理知识增强的视频LLM。对物理导向基准PhysGame和一般视频理解基准的广泛实验表明了PhysVLM的最先进性能。
English
Recent advancements in video-based large language models (Video LLMs) have witnessed the emergence of diverse capabilities to reason and interpret dynamic visual content. Among them, gameplay videos stand out as a distinctive data source, often containing glitches that defy physics commonsense. This characteristic renders them an effective benchmark for assessing the under-explored capability of physical commonsense understanding in video LLMs. In this paper, we propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos. PhysGame comprises 880 videos associated with glitches spanning four fundamental domains (i.e., mechanics, kinematics, optics, and material properties) and across 12 distinct physical commonsense. Through extensively evaluating various state-ofthe-art video LLMs, our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts. To bridge this gap, we curate an instruction tuning dataset PhysInstruct with 140,057 question-answering pairs to facilitate physical commonsense learning. In addition, we also propose a preference optimization dataset PhysDPO with 34,358 training pairs, where the dis-preferred responses are generated conditioned on misleading titles (i.e., meta information hacking), fewer frames (i.e., temporal hacking) and lower spatial resolutions (i.e., spatial hacking). Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM. Extensive experiments on both physical-oriented benchmark PhysGame and general video understanding benchmarks demonstrate the state-ofthe-art performance of PhysVLM.

Summary

AI-Generated Summary

PDF62December 3, 2024