從文字指令中合成自主角色-場景互動

摘要

在3D環境中合成人類動作，特別是那些包含複雜活動如行走、伸手以及人物與物體互動的場景，對於使用者定義的航點和階段轉換提出了重大需求。這些要求對當前模型構成挑戰，導致在從簡單人類輸入自動化角色動畫方面存在明顯差距。本文通過引入一個全面框架，直接從單一文本指令和目標位置合成多階段場景感知互動動作來應對這一挑戰。我們的方法採用自回歸擴散模型來合成下一個動作片段，並搭配一個自主調度器來預測每個動作階段的轉換。為確保合成的動作能無縫融入環境中，我們提出了一種場景表示，考慮了起點和目標位置的局部感知。我們進一步通過將幀嵌入與語言輸入相結合，增強了生成動作的連貫性。此外，為支持模型訓練，我們提出了一個包含16小時動作序列的運動捕捉數據集，在120個室內場景中涵蓋40種動作類型，每個動作都有精確的語言描述。實驗結果顯示我們的方法在生成與環境和文本條件緊密相關的高質量多階段動作方面的有效性。

English

Synthesizing human motions in 3D environments, particularly those with complex activities such as locomotion, hand-reaching, and human-object interaction, presents substantial demands for user-defined waypoints and stage transitions. These requirements pose challenges for current models, leading to a notable gap in automating the animation of characters from simple human inputs. This paper addresses this challenge by introducing a comprehensive framework for synthesizing multi-stage scene-aware interaction motions directly from a single text instruction and goal location. Our approach employs an auto-regressive diffusion model to synthesize the next motion segment, along with an autonomous scheduler predicting the transition for each action stage. To ensure that the synthesized motions are seamlessly integrated within the environment, we propose a scene representation that considers the local perception both at the start and the goal location. We further enhance the coherence of the generated motion by integrating frame embeddings with language input. Additionally, to support model training, we present a comprehensive motion-captured dataset comprising 16 hours of motion sequences in 120 indoor scenes covering 40 types of motions, each annotated with precise language descriptions. Experimental results demonstrate the efficacy of our method in generating high-quality, multi-stage motions closely aligned with environmental and textual conditions.

從文字指令中合成自主角色-場景互動

Autonomous Character-Scene Interaction Synthesis from Text Instruction

摘要

Summary

Support

Support