UniGoal:迈向通用零样本目标导向导航
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
March 13, 2025
作者: Hang Yin, Xiuwei Xu, Lingqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
cs.AI
摘要
本文提出了一种通用的零样本目标导向导航框架。现有零样本方法针对特定任务构建基于大语言模型(LLM)的推理框架,其整体流程差异较大且难以泛化至不同类型的目标。为实现通用零样本导航,我们提出了一种统一的图表示方法,将包括物体类别、实例图像和文本描述在内的不同目标统一起来。同时,我们将智能体的观测转换为在线维护的场景图。通过这种一致的场景与目标表示,相较于纯文本,我们保留了大部分结构信息,并能够利用LLM进行显式的基于图的推理。具体而言,我们在每个时间步进行场景图与目标图之间的图匹配,并根据不同的匹配状态提出不同的策略来生成长期探索目标。当零匹配时,智能体首先迭代搜索目标子图;在部分匹配时,智能体利用坐标投影和锚点对对齐来推断目标位置;最后,在完全匹配时应用场景图校正和目标验证。我们还引入了一种黑名单机制,以实现各阶段间的稳健切换。在多个基准上的大量实验表明,我们的UniGoal在三个研究的导航任务中,使用单一模型即实现了最先进的零样本性能,甚至超越了特定任务的零样本方法和监督式通用方法。
English
In this paper, we propose a general framework for universal zero-shot
goal-oriented navigation. Existing zero-shot methods build inference framework
upon large language models (LLM) for specific tasks, which differs a lot in
overall pipeline and fails to generalize across different types of goal.
Towards the aim of universal zero-shot navigation, we propose a uniform graph
representation to unify different goals, including object category, instance
image and text description. We also convert the observation of agent into an
online maintained scene graph. With this consistent scene and goal
representation, we preserve most structural information compared with pure text
and are able to leverage LLM for explicit graph-based reasoning. Specifically,
we conduct graph matching between the scene graph and goal graph at each time
instant and propose different strategies to generate long-term goal of
exploration according to different matching states. The agent first iteratively
searches subgraph of goal when zero-matched. With partial matching, the agent
then utilizes coordinate projection and anchor pair alignment to infer the goal
location. Finally scene graph correction and goal verification are applied for
perfect matching. We also present a blacklist mechanism to enable robust switch
between stages. Extensive experiments on several benchmarks show that our
UniGoal achieves state-of-the-art zero-shot performance on three studied
navigation tasks with a single model, even outperforming task-specific
zero-shot methods and supervised universal methods.Summary
AI-Generated Summary