Mogo：RQ 階層因果 Transformer 用於高質量 3D 人體動作生成

摘要

在文本轉動作生成領域中，Bert類型的遮罩模型（MoMask，MMM）目前相較於GPT類型的自回歸模型（T2M-GPT）產生更高質量的輸出。然而，這些Bert類型模型通常缺乏適用於視頻遊戲和多媒體環境中所需的流式輸出能力，這是GPT類型模型固有的特點。此外，它們在分布外生成方面表現較弱。為了超越BERT類型模型的質量，同時利用GPT類型結構，而不添加使數據擴展變得複雜的額外精煉模型，我們提出了一種新穎的架構，Mogo（Motion Only Generate Once），通過訓練單個變壓器模型生成高質量逼真的3D人體動作。Mogo僅由兩個主要組件組成：1）RVQ-VAE，一種層次化殘差向量量化變分自編碼器，將連續運動序列離散化並具有高精度；2）層次因果變壓器，負責以自回歸方式生成基本運動序列，同時推斷不同層次之間的殘差。實驗結果表明，Mogo能夠生成長達260幀（13秒）的連續和循環運動序列，超越現有數據集（如HumanML3D）的196幀（10秒）長度限制。在HumanML3D測試集上，Mogo實現了0.079的FID分數，優於GPT類型模型T2M-GPT（FID = 0.116）、AttT2M（FID = 0.112）和BERT類型模型MMM（FID = 0.080）。此外，我們的模型在分布外生成方面實現了最佳的定量性能。

English

In the field of text-to-motion generation, Bert-type Masked Models (MoMask, MMM) currently produce higher-quality outputs compared to GPT-type autoregressive models (T2M-GPT). However, these Bert-type models often lack the streaming output capability required for applications in video game and multimedia environments, a feature inherent to GPT-type models. Additionally, they demonstrate weaker performance in out-of-distribution generation. To surpass the quality of BERT-type models while leveraging a GPT-type structure, without adding extra refinement models that complicate scaling data, we propose a novel architecture, Mogo (Motion Only Generate Once), which generates high-quality lifelike 3D human motions by training a single transformer model. Mogo consists of only two main components: 1) RVQ-VAE, a hierarchical residual vector quantization variational autoencoder, which discretizes continuous motion sequences with high precision; 2) Hierarchical Causal Transformer, responsible for generating the base motion sequences in an autoregressive manner while simultaneously inferring residuals across different layers. Experimental results demonstrate that Mogo can generate continuous and cyclic motion sequences up to 260 frames (13 seconds), surpassing the 196 frames (10 seconds) length limitation of existing datasets like HumanML3D. On the HumanML3D test set, Mogo achieves a FID score of 0.079, outperforming both the GPT-type model T2M-GPT (FID = 0.116), AttT2M (FID = 0.112) and the BERT-type model MMM (FID = 0.080). Furthermore, our model achieves the best quantitative performance in out-of-distribution generation.

Mogo：RQ 階層因果 Transformer 用於高質量 3D 人體動作生成

Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation

摘要

Summary

Support

Support