DexGraspVLA：面向通用灵巧抓取的视觉-语言-动作框架

摘要

灵巧抓取仍然是机器人学中一个基础且具挑战性的问题。一款通用型机器人必须能够在任意场景下抓取多样化的物体。然而，现有研究通常依赖于特定假设，如单一物体设置或受限环境，导致泛化能力受限。我们的解决方案是DexGraspVLA，一个分层框架，它利用预训练的视觉-语言模型作为高层任务规划器，并学习一个基于扩散的策略作为低层动作控制器。其核心洞察在于迭代地将多样化的语言和视觉输入转化为领域不变的表示，在此过程中，由于领域偏移的缓解，模仿学习得以有效应用。因此，该方法能够在广泛的现实场景中实现稳健的泛化。值得注意的是，在“零样本”环境下，面对数千种未见过的物体、光照和背景组合，我们的方法取得了90%以上的成功率。实证分析进一步证实了模型内部行为在环境变化下的一致性，从而验证了我们的设计并解释了其泛化性能。我们希望这项工作能推动实现通用灵巧抓取的目标。我们的演示和代码可在https://dexgraspvla.github.io/找到。

English

Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on specific assumptions, such as single-object settings or limited environments, leading to constrained generalization. Our solution is DexGraspVLA, a hierarchical framework that utilizes a pre-trained Vision-Language model as the high-level task planner and learns a diffusion-based policy as the low-level Action controller. The key insight lies in iteratively transforming diverse language and visual inputs into domain-invariant representations, where imitation learning can be effectively applied due to the alleviation of domain shift. Thus, it enables robust generalization across a wide range of real-world scenarios. Notably, our method achieves a 90+% success rate under thousands of unseen object, lighting, and background combinations in a ``zero-shot'' environment. Empirical analysis further confirms the consistency of internal model behavior across environmental variations, thereby validating our design and explaining its generalization performance. We hope our work can be a step forward in achieving general dexterous grasping. Our demo and code can be found at https://dexgraspvla.github.io/.

DexGraspVLA：面向通用灵巧抓取的视觉-语言-动作框架

DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

摘要

Summary

Support

Support