空間思考: マルチモーダルな大規模言語モデルが空間を見、記憶し、回想する方法

要旨

人間は、連続した視覚的観察から空間を記憶する視覚空間知能を持っています。しかし、100万規模のビデオデータセットでトレーニングされたMultimodal Large Language Models（MLLMs）がビデオから「空間で考える」ことができるのでしょうか？私たちは、5,000以上の質問と回答のペアからなる革新的なビデオベースの視覚空間知能ベンチマーク（VSI-Bench）を提案し、MLLMsが競争力のあるが、亜人間的な視覚空間知能を示すことを発見しました。モデルがどのように空間で考えるかを言語的および視覚的に表現するようモデルを調査し、空間推論能力がMLLMsがより高いベンチマークパフォーマンスに到達するための主要なボトルネックである一方、これらのモデル内には局所的なワールドモデルと空間認識が現れることを見つけました。特筆すべきは、従来の言語推論技術（例：思考の連鎖、自己整合性、思考の木構造）がパフォーマンスを向上させない一方、質問回答中に認知マップを明示的に生成することがMLLMsの空間距離能力を向上させることができることです。

English

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.

空間思考: マルチモーダルな大規模言語モデルが空間を見、記憶し、回想する方法

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

要旨

Summary

Support

Support