MapEval:基於地圖的基礎模型中地理空間推理的評估

MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

December 31, 2024
作者: Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, Md Rizwan Parvez
cs.AI

摘要

最近基礎模型的進步增強了人工智慧系統在自主工具使用和推理方面的能力。然而,它們在基於位置或地圖的推理能力 - 通過優化導航、促進資源發現和優化物流而改善日常生活的能力 - 尚未受到系統性研究。為彌補這一差距,我們引入了MapEval,這是一個旨在評估各種複雜地圖為基礎用戶查詢的基準,需要進行地理空間推理。MapEval包括三種任務類型(文本、基於API、視覺),需要通過地圖工具收集世界信息,處理異構地理空間背景(例如命名實體、旅行距離、用戶評論或評分、圖像)和組合推理,這些都是所有最先進的基礎模型難以應對的挑戰。MapEval包含了關於180個城市和54個國家的位置的700個獨特多項選擇問題,評估基礎模型處理空間關係、地圖信息圖、旅行規劃和導航挑戰的能力。使用MapEval,我們對28個知名基礎模型進行了全面評估。雖然沒有單一模型在所有任務上表現出色,但Claude-3.5-Sonnet、GPT-4o和Gemini-1.5-Pro在整體性能方面表現出競爭力。然而,出現了顯著的性能差距,特別是在MapEval中,Claude-3.5-Sonnet的代理優於GPT-4o和Gemini-1.5-Pro分別16%和21%,與開源LLMs相比,這些差距變得更加明顯。我們的詳細分析提供了有關當前模型優勢和劣勢的見解,盡管所有模型在複雜地圖圖像和嚴格的地理空間推理方面仍然遠遠落後於人類表現超過20%的平均水平。這一差距凸顯了MapEval在推動具有更強地理空間理解能力的通用基礎模型方面的關鍵作用。
English
Recent advancements in foundation models have enhanced AI systems' capabilities in autonomous tool usage and reasoning. However, their ability in location or map-based reasoning - which improves daily life by optimizing navigation, facilitating resource discovery, and streamlining logistics - has not been systematically studied. To bridge this gap, we introduce MapEval, a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning. MapEval features three task types (textual, API-based, and visual) that require collecting world information via map tools, processing heterogeneous geo-spatial contexts (e.g., named entities, travel distances, user reviews or ratings, images), and compositional reasoning, which all state-of-the-art foundation models find challenging. Comprising 700 unique multiple-choice questions about locations across 180 cities and 54 countries, MapEval evaluates foundation models' ability to handle spatial relationships, map infographics, travel planning, and navigation challenges. Using MapEval, we conducted a comprehensive evaluation of 28 prominent foundation models. While no single model excelled across all tasks, Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieved competitive performance overall. However, substantial performance gaps emerged, particularly in MapEval, where agents with Claude-3.5-Sonnet outperformed GPT-4o and Gemini-1.5-Pro by 16% and 21%, respectively, and the gaps became even more amplified when compared to open-source LLMs. Our detailed analyses provide insights into the strengths and weaknesses of current models, though all models still fall short of human performance by more than 20% on average, struggling with complex map images and rigorous geo-spatial reasoning. This gap highlights MapEval's critical role in advancing general-purpose foundation models with stronger geo-spatial understanding.

Summary

AI-Generated Summary

PDF222January 3, 2025