NAVIG:基于视觉语言模型的自然语言引导图像地理定位分析
NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
February 20, 2025
作者: Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber
cs.AI
摘要
图像地理定位是一项预测图像具体位置的任务,需要在视觉、地理和文化背景之间进行复杂的推理。尽管现有的视觉语言模型(VLMs)在此任务上具有最佳准确率,但高质量的数据集和分析推理模型仍显匮乏。我们首先创建了NaviClues,这是一个源自热门地理游戏GeoGuessr的高质量数据集,旨在提供语言层面的专家推理示例。利用该数据集,我们提出了Navig,一个综合性的图像地理定位框架,整合了全局和细粒度的图像信息。通过语言推理,Navig将平均距离误差较之前的最先进模型降低了14%,且所需训练样本不足1000个。我们的数据集和代码可在https://github.com/SparrowZheyuan18/Navig/获取。
English
Image geo-localization is the task of predicting the specific location of an
image and requires complex reasoning across visual, geographical, and cultural
contexts. While prior Vision Language Models (VLMs) have the best accuracy at
this task, there is a dearth of high-quality datasets and models for analytical
reasoning. We first create NaviClues, a high-quality dataset derived from
GeoGuessr, a popular geography game, to supply examples of expert reasoning
from language. Using this dataset, we present Navig, a comprehensive image
geo-localization framework integrating global and fine-grained image
information. By reasoning with language, Navig reduces the average distance
error by 14% compared to previous state-of-the-art models while requiring fewer
than 1000 training samples. Our dataset and code are available at
https://github.com/SparrowZheyuan18/Navig/.Summary
AI-Generated Summary