Mogo:RQ 階層因果 Transformer 用於高質量 3D 人體動作生成
Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation
December 5, 2024
作者: Dongjie Fu
cs.AI
摘要
在文本轉動作生成領域中,Bert類型的遮罩模型(MoMask,MMM)目前相較於GPT類型的自回歸模型(T2M-GPT)產生更高質量的輸出。然而,這些Bert類型模型通常缺乏適用於視頻遊戲和多媒體環境中所需的流式輸出能力,這是GPT類型模型固有的特點。此外,它們在分布外生成方面表現較弱。為了超越BERT類型模型的質量,同時利用GPT類型結構,而不添加使數據擴展變得複雜的額外精煉模型,我們提出了一種新穎的架構,Mogo(Motion Only Generate Once),通過訓練單個變壓器模型生成高質量逼真的3D人體動作。Mogo僅由兩個主要組件組成:1)RVQ-VAE,一種層次化殘差向量量化變分自編碼器,將連續運動序列離散化並具有高精度;2)層次因果變壓器,負責以自回歸方式生成基本運動序列,同時推斷不同層次之間的殘差。實驗結果表明,Mogo能夠生成長達260幀(13秒)的連續和循環運動序列,超越現有數據集(如HumanML3D)的196幀(10秒)長度限制。在HumanML3D測試集上,Mogo實現了0.079的FID分數,優於GPT類型模型T2M-GPT(FID = 0.116)、AttT2M(FID = 0.112)和BERT類型模型MMM(FID = 0.080)。此外,我們的模型在分布外生成方面實現了最佳的定量性能。
English
In the field of text-to-motion generation, Bert-type Masked Models (MoMask,
MMM) currently produce higher-quality outputs compared to GPT-type
autoregressive models (T2M-GPT). However, these Bert-type models often lack the
streaming output capability required for applications in video game and
multimedia environments, a feature inherent to GPT-type models. Additionally,
they demonstrate weaker performance in out-of-distribution generation. To
surpass the quality of BERT-type models while leveraging a GPT-type structure,
without adding extra refinement models that complicate scaling data, we propose
a novel architecture, Mogo (Motion Only Generate Once), which generates
high-quality lifelike 3D human motions by training a single transformer model.
Mogo consists of only two main components: 1) RVQ-VAE, a hierarchical residual
vector quantization variational autoencoder, which discretizes continuous
motion sequences with high precision; 2) Hierarchical Causal Transformer,
responsible for generating the base motion sequences in an autoregressive
manner while simultaneously inferring residuals across different layers.
Experimental results demonstrate that Mogo can generate continuous and cyclic
motion sequences up to 260 frames (13 seconds), surpassing the 196 frames (10
seconds) length limitation of existing datasets like HumanML3D. On the
HumanML3D test set, Mogo achieves a FID score of 0.079, outperforming both the
GPT-type model T2M-GPT (FID = 0.116), AttT2M (FID = 0.112) and the BERT-type
model MMM (FID = 0.080). Furthermore, our model achieves the best quantitative
performance in out-of-distribution generation.Summary
AI-Generated Summary