SoFar:语言锚定的空间定向桥梁——连接空间推理与物体操作
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
February 18, 2025
作者: Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi
cs.AI
摘要
空间智能是具身人工智能的关键组成部分,它推动机器人理解并与其环境进行交互。尽管近期进展提升了视觉语言模型(VLMs)感知物体位置及位置关系的能力,但它们仍无法精确理解物体朝向——这是涉及精细操作任务的一项核心需求。解决这一局限不仅需要几何推理,还需一种表达力强且直观的朝向表示方法。在此背景下,我们提出自然语言相较于标准坐标系提供了更为灵活的表示空间,使其特别适合指令跟随型机器人系统。本文中,我们引入了语义朝向的概念,它利用自然语言以无参考框架的方式定义物体朝向(例如,USB的“插入”方向或刀具的“握柄”方向)。为支持此概念,我们构建了OrienText300K,一个大规模的三维模型数据集,其中标注了将几何理解与功能语义相连接的语义朝向。通过将语义朝向整合到VLM系统中,我们使机器人能够生成同时满足位置和朝向约束的操作动作。大量的仿真与真实世界实验表明,我们的方法显著增强了机器人操作能力,例如在Open6DOR上达到48.7%的准确率,在SIMPLER上达到74.9%的准确率。
English
Spatial intelligence is a critical component of embodied AI, promoting robots
to understand and interact with their environments. While recent advances have
enhanced the ability of VLMs to perceive object locations and positional
relationships, they still lack the capability to precisely understand object
orientations-a key requirement for tasks involving fine-grained manipulations.
Addressing this limitation not only requires geometric reasoning but also an
expressive and intuitive way to represent orientation. In this context, we
propose that natural language offers a more flexible representation space than
canonical frames, making it particularly suitable for instruction-following
robotic systems. In this paper, we introduce the concept of semantic
orientation, which defines object orientations using natural language in a
reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the
''handle'' direction of a knife). To support this, we construct OrienText300K,
a large-scale dataset of 3D models annotated with semantic orientations that
link geometric understanding to functional semantics. By integrating semantic
orientation into a VLM system, we enable robots to generate manipulation
actions with both positional and orientational constraints. Extensive
experiments in simulation and real world demonstrate that our approach
significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy
on Open6DOR and 74.9% accuracy on SIMPLER.Summary
AI-Generated Summary