SOLAMI:用於與3D自主角色進行沉浸式互動的社交視覺語言行為建模
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
November 29, 2024
作者: Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, Ziwei Liu
cs.AI
摘要
人類是社會性動物。如何讓3D自主角色具備類似的社會智能,能感知、理解和與人類互動,仍然是一個開放且基本的問題。在本文中,我們介紹了SOLAMI,這是第一個用於與3D自主角色進行沉浸式互動的端到端社會視覺-語言-行動(VLA)建模框架。具體來說,SOLAMI從三個方面構建3D自主角色:(1)社會VLA架構:我們提出了一個統一的社會VLA框架,根據用戶的多模態輸入生成多模態回應(語音和動作),以驅動角色進行社會互動。 (2)互動式多模態數據:我們提出了SynMSI,這是一個由自動流程生成的合成多模態社會互動數據集,僅使用現有的動作數據集來解決數據稀缺性問題。 (3)沉浸式虛擬現實界面:我們開發了一個虛擬現實界面,使用戶可以與這些由各種架構驅動的角色進行沉浸式互動。大量的定量實驗和用戶研究表明,我們的框架能夠產生更準確和自然的角色回應(包括語音和動作),並且與用戶期望相符,具有更低的延遲。
English
Human beings are social animals. How to equip 3D autonomous characters with
similar social intelligence that can perceive, understand and interact with
humans remains an open yet foundamental problem. In this paper, we introduce
SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling
framework for Immersive interaction with 3D autonomous characters.
Specifically, SOLAMI builds 3D autonomous characters from three aspects: (1)
Social VLA Architecture: We propose a unified social VLA framework to generate
multimodal response (speech and motion) based on the user's multimodal input to
drive the character for social interaction. (2) Interactive Multimodal Data: We
present SynMSI, a synthetic multimodal social interaction dataset generated by
an automatic pipeline using only existing motion datasets to address the issue
of data scarcity. (3) Immersive VR Interface: We develop a VR interface that
enables users to immersively interact with these characters driven by various
architectures. Extensive quantitative experiments and user studies demonstrate
that our framework leads to more precise and natural character responses (in
both speech and motion) that align with user expectations with lower latency.Summary
AI-Generated Summary