에마-엑스: 뿌리를 내린 사고 체인과 선행 공간 추론을 갖춘 신체적 다중 모달 행동 모델

초록

전통적인 강화 학습 기반의 로봇 제어 방법은 종종 특정 작업에 한정되어 다양한 환경이나 보이지 않는 객체 및 지시사항에 대해 일반화하지 못하는 경우가 있습니다. 시각 언어 모델(Visual Language Models, VLMs)은 강력한 장면 이해 및 계획 능력을 보여주지만, 특정 로봇 구현에 맞는 실행 가능한 정책을 생성하는 능력이 부족합니다. 이를 해결하기 위해 시각-언어-행동(Visual-Language-Action, VLA) 모델이 등장했지만, 장기적인 공간 추론 및 기반 작업 계획에 대한 도전에 직면하고 있습니다. 본 연구에서는 저희는 Emma-X라는 Grounded Chain of Thought 및 Look-ahead Spatial Reasoning을 갖춘 실체화된 다중 모달 액션 모델을 제안합니다. Emma-X는 BridgeV2를 기반으로 한 계층적 실체화 데이터셋을 활용하며, 이 데이터셋에는 60,000개의 로봇 조작 궤적이 포함되어 있고, 이는 기반 작업 추론 및 공간 안내와 자동 주석이 달려 있습니다. 또한, 그리퍼 상태와 동작 궤적을 기반으로 한 궤적 분할 전략을 소개하여, 하위 작업 추론 생성 시 환각을 완화하는 데 도움이 될 수 있습니다. 실험 결과는 Emma-X가 특히 공간 추론이 필요한 실제 로봇 작업에서 경쟁 기준선보다 우수한 성능을 달성한다는 것을 보여줍니다.

English

Traditional reinforcement learning-based robotic control methods are often task-specific and fail to generalize across diverse environments or unseen objects and instructions. Visual Language Models (VLMs) demonstrate strong scene understanding and planning capabilities but lack the ability to generate actionable policies tailored to specific robotic embodiments. To address this, Visual-Language-Action (VLA) models have emerged, yet they face challenges in long-horizon spatial reasoning and grounded task planning. In this work, we propose the Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning, Emma-X. Emma-X leverages our constructed hierarchical embodiment dataset based on BridgeV2, containing 60,000 robot manipulation trajectories auto-annotated with grounded task reasoning and spatial guidance. Additionally, we introduce a trajectory segmentation strategy based on gripper states and motion trajectories, which can help mitigate hallucination in grounding subtask reasoning generation. Experimental results demonstrate that Emma-X achieves superior performance over competitive baselines, particularly in real-world robotic tasks requiring spatial reasoning.

에마-엑스: 뿌리를 내린 사고 체인과 선행 공간 추론을 갖춘 신체적 다중 모달 행동 모델

Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning

초록

Support