空間思維:多模式大型語言模型如何看待、記憶和召回空間
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
December 18, 2024
作者: Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, Saining Xie
cs.AI
摘要
人類擁有視覺空間智能,能夠從連續的視覺觀察中記住空間。然而,經過百萬規模視頻數據集訓練的多模式大型語言模型(MLLMs)是否也能從視頻中「以空間思考」?我們提出了一個新穎的基於視頻的視覺空間智能基準(VSI-Bench),包含超過5,000個問答對,發現MLLMs展現出具有競爭力的 - 雖然不及人類 - 視覺空間智能。我們探究模型如何在語言和視覺上以空間思考,發現雖然空間推理能力仍是MLLMs達到更高基準性能的主要瓶頸,但局部世界模型和空間意識在這些模型中也有所呈現。值得注意的是,目前主流的語言推理技術(例如,思維鏈、自洽性、思維樹)未能提高性能,而在問答過程中明確生成認知地圖則增強了MLLMs的空間距離能力。
English
Humans possess the visual-spatial intelligence to remember spaces from
sequential visual observations. However, can Multimodal Large Language Models
(MLLMs) trained on million-scale video datasets also ``think in space'' from
videos? We present a novel video-based visual-spatial intelligence benchmark
(VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit
competitive - though subhuman - visual-spatial intelligence. We probe models to
express how they think in space both linguistically and visually and find that
while spatial reasoning capabilities remain the primary bottleneck for MLLMs to
reach higher benchmark performance, local world models and spatial awareness do
emerge within these models. Notably, prevailing linguistic reasoning techniques
(e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve
performance, whereas explicitly generating cognitive maps during
question-answering enhances MLLMs' spatial distance ability.Summary
AI-Generated Summary