使用“状态自适应专家混合模型”学习通用语言引导的视觉导航

摘要

学术领域中学习指导的视觉导航通常可分为高级别类别特定搜索和低级别语言引导导航，取决于语言指导的粒度，前者强调探索过程，而后者集中于遵循详细的文本命令。尽管这些任务的重点不同，但解释指令、理解环境和推断行动决策的基本要求保持一致。本文将各种导航任务整合到一个统一且通用的框架中——我们研究了在学习导航中共享通用知识和利用任务特定能力的核心困难，并提出了一种新颖的状态自适应专家混合（SAME）模型，有效地使代理能够根据不同粒度的语言和动态观察推断决策。借助SAME的支持，我们提出了一个多才多艺的代理，能够同时处理七个导航任务，其表现优于或与任务特定代理的表现高度可比。

English

The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent. This paper consolidates diverse navigation tasks into a unified and generic framework -- we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously that outperforms or achieves highly comparable performance to task-specific agents.

使用“状态自适应专家混合模型”学习通用语言引导的视觉导航

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

摘要

Summary

Support