SOLAMI：用于与3D自主角色进行沉浸式互动的社交视觉-语言-行为建模

摘要

人类是社会动物。如何为3D自主角色配备类似社会智能，使其能感知、理解和与人类互动，仍然是一个尚未解决但基础的问题。在本文中，我们介绍了SOLAMI，这是第一个端到端的用于与3D自主角色进行沉浸式互动的社会视觉-语言-行为（VLA）建模框架。具体而言，SOLAMI从三个方面构建3D自主角色：（1）社会VLA架构：我们提出了一个统一的社会VLA框架，根据用户的多模态输入生成多模态响应（语音和动作），驱动角色进行社交互动。（2）交互式多模态数据：我们提出了SynMSI，这是一个通过自动流程仅使用现有动作数据集生成的合成多模态社交互动数据集，以解决数据稀缺问题。（3）沉浸式虚拟现实界面：我们开发了一个虚拟现实界面，使用户能够沉浸式地与这些角色进行互动，这些角色由各种架构驱动。大量定量实验和用户研究表明，我们的框架导致更精确和自然的角色响应（包括语音和动作），符合用户期望，并具有更低的延迟。

English

Human beings are social animals. How to equip 3D autonomous characters with similar social intelligence that can perceive, understand and interact with humans remains an open yet foundamental problem. In this paper, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: (1) Social VLA Architecture: We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user's multimodal input to drive the character for social interaction. (2) Interactive Multimodal Data: We present SynMSI, a synthetic multimodal social interaction dataset generated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. (3) Immersive VR Interface: We develop a VR interface that enables users to immersively interact with these characters driven by various architectures. Extensive quantitative experiments and user studies demonstrate that our framework leads to more precise and natural character responses (in both speech and motion) that align with user expectations with lower latency.

SOLAMI：用于与3D自主角色进行沉浸式互动的社交视觉-语言-行为建模

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

摘要

Summary

Support

Support