MapEval：基盤モデルにおける地理空間推論のマップベース評価

要旨

最近の基盤モデルの進歩により、AIシステムの自律的なツール使用と推論能力が向上しました。ただし、日常生活を最適化し、ナビゲーションを改善し、リソースの発見を容易にし、物流を効率化することで向上する、位置や地図ベースの推論能力は、系統的に研究されていませんでした。このギャップを埋めるために、地理空間推論を用いた多様で複雑な地図ベースのユーザークエリを評価するために設計されたベンチマークであるMapEvalを紹介します。MapEvalには、地図ツールを使用して世界情報を収集し、異種の地理空間コンテキスト（例：固有名詞、移動距離、ユーザーレビューや評価、画像）を処理し、合成推論を行うという3つのタスクタイプ（テキスト、APIベース、ビジュアル）が特徴として含まれています。これらは、すべての最先端の基盤モデルが難しいと認識しています。180の都市と54の国にまたがる場所に関する700のユニークな多肢選択問題を含むMapEvalは、基盤モデルが空間関係、地図インフォグラフィック、旅行計画、およびナビゲーションの課題を処理する能力を評価します。MapEvalを使用して、28の主要な基盤モデルの包括的な評価を実施しました。すべてのタスクで卓越したパフォーマンスを発揮した単一のモデルはありませんでしたが、Claude-3.5-Sonnet、GPT-4o、Gemini-1.5-Proが全体的に競争力のあるパフォーマンスを達成しました。ただし、特にMapEvalでは、Claude-3.5-Sonnetを使用したエージェントがGPT-4oとGemini-1.5-Proをそれぞれ16%と21%上回り、オープンソースのLLMと比較した際に差がさらに拡大しました。詳細な分析により、現在のモデルの強みと弱みについての洞察が提供されますが、すべてのモデルは、複雑な地図画像と厳密な地理空間推論に苦労しており、平均20%以上の人間のパフォーマンスには及んでいません。このギャップは、MapEvalが一般的な基盤モデルをより強力な地理空間理解に進化させる上で重要な役割を果たしていることを強調しています。

English

Recent advancements in foundation models have enhanced AI systems' capabilities in autonomous tool usage and reasoning. However, their ability in location or map-based reasoning - which improves daily life by optimizing navigation, facilitating resource discovery, and streamlining logistics - has not been systematically studied. To bridge this gap, we introduce MapEval, a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning. MapEval features three task types (textual, API-based, and visual) that require collecting world information via map tools, processing heterogeneous geo-spatial contexts (e.g., named entities, travel distances, user reviews or ratings, images), and compositional reasoning, which all state-of-the-art foundation models find challenging. Comprising 700 unique multiple-choice questions about locations across 180 cities and 54 countries, MapEval evaluates foundation models' ability to handle spatial relationships, map infographics, travel planning, and navigation challenges. Using MapEval, we conducted a comprehensive evaluation of 28 prominent foundation models. While no single model excelled across all tasks, Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieved competitive performance overall. However, substantial performance gaps emerged, particularly in MapEval, where agents with Claude-3.5-Sonnet outperformed GPT-4o and Gemini-1.5-Pro by 16% and 21%, respectively, and the gaps became even more amplified when compared to open-source LLMs. Our detailed analyses provide insights into the strengths and weaknesses of current models, though all models still fall short of human performance by more than 20% on average, struggling with complex map images and rigorous geo-spatial reasoning. This gap highlights MapEval's critical role in advancing general-purpose foundation models with stronger geo-spatial understanding.

MapEval：基盤モデルにおける地理空間推論のマップベース評価

MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

要旨

Support