MapEval: 기초 모델에서 지리 공간 추론의 지도 기반 평가

초록

최근 기초 모델의 발전으로 AI 시스템은 자율 도구 사용 및 추론 능력이 향상되었습니다. 그러나 일상 생활을 최적화하고 내비게이션을 향상시키며 자원 발견을 용이하게 하고 물류를 최적화함으로써 삶을 향상시키는 위치 또는 지도 기반 추론 능력은 체계적으로 연구되지 않았습니다. 이 간극을 메우기 위해 우리는 지리-공간 추론을 통해 다양하고 복잡한 지도 기반 사용자 쿼리를 평가하기 위해 설계된 벤치마크인 MapEval을 소개합니다. MapEval은 지도 도구를 통해 세계 정보를 수집하고 이질적인 지리-공간 맥락(예: 명명된 개체, 여행 거리, 사용자 리뷰 또는 평가, 이미지)를 처리하며 모든 최첨단 기초 모델이 어려워하는 합성 추론을 요구하는 세 가지 유형의 작업(텍스트, API 기반, 시각)을 특징으로 합니다. 180개 도시와 54개 국가에 걸쳐 위치에 대한 700개의 독특한 객관식 질문으로 구성된 MapEval은 기초 모델이 공간 관계, 지도 정보 그래픽, 여행 계획 및 내비게이션 과제를 처리하는 능력을 평가합니다. MapEval을 사용하여 우리는 28가지 주요 기초 모델을 철저히 평가했습니다. 모든 작업에서 뛰어난 성과를 거둔 단일 모델은 없었지만 Claude-3.5-Sonnet, GPT-4o 및 Gemini-1.5-Pro이 전반적으로 경쟁력 있는 성과를 달성했습니다. 그러나 특히 MapEval에서 상당한 성능 차이가 드러났으며, Claude-3.5-Sonnet을 사용한 에이전트는 GPT-4o 및 Gemini-1.5-Pro보다 각각 16% 및 21% 우수한 성과를 보였으며, 오픈 소스 LLM과 비교했을 때 차이가 더 커졌습니다. 우리의 상세한 분석은 현재 모델의 강점과 약점에 대한 통찰을 제공하지만 모든 모델은 여전히 복잡한 지도 이미지와 엄격한 지리-공간 추론에 어려움을 겪으며 인간의 성능을 평균 20% 이상 초과하는 부분에서 모두 부족함이 드러납니다. 이 간극은 MapEval이 일반적인 목적의 기초 모델을 강화시키는 데 중요한 역할을 한다는 점을 강조합니다.

English

Recent advancements in foundation models have enhanced AI systems' capabilities in autonomous tool usage and reasoning. However, their ability in location or map-based reasoning - which improves daily life by optimizing navigation, facilitating resource discovery, and streamlining logistics - has not been systematically studied. To bridge this gap, we introduce MapEval, a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning. MapEval features three task types (textual, API-based, and visual) that require collecting world information via map tools, processing heterogeneous geo-spatial contexts (e.g., named entities, travel distances, user reviews or ratings, images), and compositional reasoning, which all state-of-the-art foundation models find challenging. Comprising 700 unique multiple-choice questions about locations across 180 cities and 54 countries, MapEval evaluates foundation models' ability to handle spatial relationships, map infographics, travel planning, and navigation challenges. Using MapEval, we conducted a comprehensive evaluation of 28 prominent foundation models. While no single model excelled across all tasks, Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieved competitive performance overall. However, substantial performance gaps emerged, particularly in MapEval, where agents with Claude-3.5-Sonnet outperformed GPT-4o and Gemini-1.5-Pro by 16% and 21%, respectively, and the gaps became even more amplified when compared to open-source LLMs. Our detailed analyses provide insights into the strengths and weaknesses of current models, though all models still fall short of human performance by more than 20% on average, struggling with complex map images and rigorous geo-spatial reasoning. This gap highlights MapEval's critical role in advancing general-purpose foundation models with stronger geo-spatial understanding.

MapEval: 기초 모델에서 지리 공간 추론의 지도 기반 평가

MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

초록

Summary

Support