空間思維：多模式大型語言模型如何看待、記憶和召回空間

摘要

人類擁有視覺空間智能，能夠從連續的視覺觀察中記住空間。然而，經過百萬規模視頻數據集訓練的多模式大型語言模型（MLLMs）是否也能從視頻中「以空間思考」？我們提出了一個新穎的基於視頻的視覺空間智能基準（VSI-Bench），包含超過5,000個問答對，發現MLLMs展現出具有競爭力的 - 雖然不及人類 - 視覺空間智能。我們探究模型如何在語言和視覺上以空間思考，發現雖然空間推理能力仍是MLLMs達到更高基準性能的主要瓶頸，但局部世界模型和空間意識在這些模型中也有所呈現。值得注意的是，目前主流的語言推理技術（例如，思維鏈、自洽性、思維樹）未能提高性能，而在問答過程中明確生成認知地圖則增強了MLLMs的空間距離能力。

English

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.

空間思維：多模式大型語言模型如何看待、記憶和召回空間

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

摘要

Summary

Support