MapEval:基于地图的基础模型中地理空间推理的评估
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models
December 31, 2024
作者: Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, Md Rizwan Parvez
cs.AI
摘要
最近基础模型的进展增强了人工智能系统在自主工具使用和推理方面的能力。然而,它们在基于位置或地图的推理能力——通过优化导航、促进资源发现和简化物流而改善日常生活的能力——尚未得到系统研究。为了弥补这一差距,我们引入了MapEval,这是一个旨在评估具有地理空间推理的多样化和复杂地图用户查询的基准。MapEval包括三种任务类型(文本、基于API和视觉),需要通过地图工具收集世界信息,处理异构的地理空间背景(例如命名实体、旅行距离、用户评论或评分、图像)和组合推理,这些都是当前最先进的基础模型难以应对的挑战。MapEval包含了关于180个城市和54个国家的位置的700个独特的多项选择问题,评估基础模型处理空间关系、地图信息图表、旅行规划和导航挑战的能力。利用MapEval,我们对28个知名基础模型进行了全面评估。虽然没有单一模型在所有任务上表现优异,但Claude-3.5-Sonnet、GPT-4o和Gemini-1.5-Pro在整体上表现出色。然而,出现了明显的性能差距,特别是在MapEval中,Claude-3.5-Sonnet的代理比GPT-4o和Gemini-1.5-Pro分别高出16%和21%,与开源LLMs相比,差距甚至更加明显。我们的详细分析提供了对当前模型优势和劣势的见解,尽管所有模型在复杂地图图像和严格的地理空间推理方面仍然比人类表现差距超过20%。这一差距突显了MapEval在推进具有更强地理空间理解的通用基础模型方面的关键作用。
English
Recent advancements in foundation models have enhanced AI systems'
capabilities in autonomous tool usage and reasoning. However, their ability in
location or map-based reasoning - which improves daily life by optimizing
navigation, facilitating resource discovery, and streamlining logistics - has
not been systematically studied. To bridge this gap, we introduce MapEval, a
benchmark designed to assess diverse and complex map-based user queries with
geo-spatial reasoning. MapEval features three task types (textual, API-based,
and visual) that require collecting world information via map tools, processing
heterogeneous geo-spatial contexts (e.g., named entities, travel distances,
user reviews or ratings, images), and compositional reasoning, which all
state-of-the-art foundation models find challenging. Comprising 700 unique
multiple-choice questions about locations across 180 cities and 54 countries,
MapEval evaluates foundation models' ability to handle spatial relationships,
map infographics, travel planning, and navigation challenges. Using MapEval, we
conducted a comprehensive evaluation of 28 prominent foundation models. While
no single model excelled across all tasks, Claude-3.5-Sonnet, GPT-4o, and
Gemini-1.5-Pro achieved competitive performance overall. However, substantial
performance gaps emerged, particularly in MapEval, where agents with
Claude-3.5-Sonnet outperformed GPT-4o and Gemini-1.5-Pro by 16% and 21%,
respectively, and the gaps became even more amplified when compared to
open-source LLMs. Our detailed analyses provide insights into the strengths and
weaknesses of current models, though all models still fall short of human
performance by more than 20% on average, struggling with complex map images and
rigorous geo-spatial reasoning. This gap highlights MapEval's critical role in
advancing general-purpose foundation models with stronger geo-spatial
understanding.Summary
AI-Generated Summary