空间思维：多模态大型语言模型如何看待、记忆和回忆空间

摘要

人类具有视觉-空间智能，能够从连续的视觉观察中记忆空间。然而，经过百万规模视频数据集训练的多模态大型语言模型（MLLMs）是否也能从视频中“以空间思维”？我们提出了一个新颖的基于视频的视觉-空间智能基准（VSI-Bench），包含超过5,000个问题-答案对，并发现MLLMs展现出有竞争力的 - 虽然是次人类的 - 视觉-空间智能。我们探究模型如何在语言和视觉上表达其空间思维，并发现，虽然空间推理能力仍然是MLLMs达到更高基准性能的主要瓶颈，但局部世界模型和空间意识在这些模型中也得到了体现。值得注意的是，目前流行的语言推理技术（例如，思维链、自洽性、思维树）未能提高性能，而在问答过程中显式生成认知地图则增强了MLLMs的空间距离能力。

English

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.

空间思维：多模态大型语言模型如何看待、记忆和回忆空间

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

摘要

Support