PhysGame:在遊戲影片中揭示物理常識違反
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
December 2, 2024
作者: Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, Xiaodan Liang
cs.AI
摘要
最近在基於視頻的大型語言模型(Video LLMs)方面取得了重大進展,出現了多樣的能力,可以推理和解釋動態視覺內容。其中,遊戲視頻作為一種獨特的數據來源脫穎而出,通常包含違反物理常識的故障。這種特徵使它們成為評估視頻LLMs中物理常識理解能力的一個有效基準。在本文中,我們提出了PhysGame作為一個開創性的基準,用於評估遊戲視頻中的物理常識違反。PhysGame包括880個與故障相關的視頻,涵蓋四個基本領域(即機械、運動學、光學和材料特性),跨越12個不同的物理常識。通過對各種最先進的視頻LLMs進行廣泛評估,我們的研究發現,目前開源視頻LLMs的性能明顯落後於專有對手。為了彌合這一差距,我們整理了一個指令調整數據集PhysInstruct,其中包含140,057個問答對,以促進物理常識學習。此外,我們還提出了一個偏好優化數據集PhysDPO,其中包含34,358個訓練對,根據具有誤導性標題(即元信息黑客)、較少幀數(即時間黑客)和較低空間分辨率(即空間黑客)生成不受歡迎的回答。基於這一系列數據集,我們提出了PhysVLM作為一種物理知識增強的視頻LLM。對物理導向基準PhysGame和一般視頻理解基準進行的大量實驗表明了PhysVLM的最先進性能。
English
Recent advancements in video-based large language models (Video LLMs) have
witnessed the emergence of diverse capabilities to reason and interpret dynamic
visual content. Among them, gameplay videos stand out as a distinctive data
source, often containing glitches that defy physics commonsense. This
characteristic renders them an effective benchmark for assessing the
under-explored capability of physical commonsense understanding in video LLMs.
In this paper, we propose PhysGame as a pioneering benchmark to evaluate
physical commonsense violations in gameplay videos. PhysGame comprises 880
videos associated with glitches spanning four fundamental domains (i.e.,
mechanics, kinematics, optics, and material properties) and across 12 distinct
physical commonsense. Through extensively evaluating various state-ofthe-art
video LLMs, our findings reveal that the performance of current open-source
video LLMs significantly lags behind that of proprietary counterparts. To
bridge this gap, we curate an instruction tuning dataset PhysInstruct with
140,057 question-answering pairs to facilitate physical commonsense learning.
In addition, we also propose a preference optimization dataset PhysDPO with
34,358 training pairs, where the dis-preferred responses are generated
conditioned on misleading titles (i.e., meta information hacking), fewer frames
(i.e., temporal hacking) and lower spatial resolutions (i.e., spatial hacking).
Based on the suite of datasets, we propose PhysVLM as a physical
knowledge-enhanced video LLM. Extensive experiments on both physical-oriented
benchmark PhysGame and general video understanding benchmarks demonstrate the
state-ofthe-art performance of PhysVLM.Summary
AI-Generated Summary