空间思维:多模态大型语言模型如何看待、记忆和回忆空间
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
December 18, 2024
作者: Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, Saining Xie
cs.AI
摘要
人类具有视觉-空间智能,能够从连续的视觉观察中记忆空间。然而,经过百万规模视频数据集训练的多模态大型语言模型(MLLMs)是否也能从视频中“以空间思维”?我们提出了一个新颖的基于视频的视觉-空间智能基准(VSI-Bench),包含超过5,000个问题-答案对,并发现MLLMs展现出有竞争力的 - 虽然是次人类的 - 视觉-空间智能。我们探究模型如何在语言和视觉上表达其空间思维,并发现,虽然空间推理能力仍然是MLLMs达到更高基准性能的主要瓶颈,但局部世界模型和空间意识在这些模型中也得到了体现。值得注意的是,目前流行的语言推理技术(例如,思维链、自洽性、思维树)未能提高性能,而在问答过程中显式生成认知地图则增强了MLLMs的空间距离能力。
English
Humans possess the visual-spatial intelligence to remember spaces from
sequential visual observations. However, can Multimodal Large Language Models
(MLLMs) trained on million-scale video datasets also ``think in space'' from
videos? We present a novel video-based visual-spatial intelligence benchmark
(VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit
competitive - though subhuman - visual-spatial intelligence. We probe models to
express how they think in space both linguistically and visually and find that
while spatial reasoning capabilities remain the primary bottleneck for MLLMs to
reach higher benchmark performance, local world models and spatial awareness do
emerge within these models. Notably, prevailing linguistic reasoning techniques
(e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve
performance, whereas explicitly generating cognitive maps during
question-answering enhances MLLMs' spatial distance ability.Summary
AI-Generated Summary