KMM：用於擴展運動生成的關鍵幀遮罩瑪巴

摘要

人類動作生成是生成式計算機視覺研究的前沿領域，具有在視頻創建、遊戲開發和機器人操作等方面有潛力的應用。最近的Mamba架構展示了在高效建模長且複雜序列方面的有希望的結果，但仍存在兩個重要挑戰：首先，直接將Mamba應用於延長動作生成是無效的，因為隱式記憶的有限容量導致記憶衰減。其次，與Transformer相比，Mamba在多模態融合方面遇到困難，並且缺乏與文本查詢的對齊，經常混淆方向（左還是右）或省略較長文本查詢的部分。為了應對這些挑戰，我們的論文提出了三個關鍵貢獻：首先，我們引入了KMM，一種新穎的架構，具有關鍵幀遮罩建模，旨在增強Mamba對動作片段中關鍵動作的關注。這種方法解決了記憶衰減問題，並代表了在SSM中定制戰略幀級遮罩的開拓性方法。此外，我們設計了一種對比學習範式，以解決Mamba中的多模態融合問題，並改善動作-文本對齊。最後，我們在主流數據集BABEL上進行了大量實驗，在FID上實現了超過57%的性能提升，並與先前最先進方法相比，參數減少了70%。請參閱項目網站：https://steve-zeyu-zhang.github.io/KMM

English

Human motion generation is a cut-edge area of research in generative computer vision, with promising applications in video creation, game development, and robotic manipulation. The recent Mamba architecture shows promising results in efficiently modeling long and complex sequences, yet two significant challenges remain: Firstly, directly applying Mamba to extended motion generation is ineffective, as the limited capacity of the implicit memory leads to memory decay. Secondly, Mamba struggles with multimodal fusion compared to Transformers, and lack alignment with textual queries, often confusing directions (left or right) or omitting parts of longer text queries. To address these challenges, our paper presents three key contributions: Firstly, we introduce KMM, a novel architecture featuring Key frame Masking Modeling, designed to enhance Mamba's focus on key actions in motion segments. This approach addresses the memory decay problem and represents a pioneering method in customizing strategic frame-level masking in SSMs. Additionally, we designed a contrastive learning paradigm for addressing the multimodal fusion problem in Mamba and improving the motion-text alignment. Finally, we conducted extensive experiments on the go-to dataset, BABEL, achieving state-of-the-art performance with a reduction of more than 57% in FID and 70% parameters compared to previous state-of-the-art methods. See project website: https://steve-zeyu-zhang.github.io/KMM

KMM：用於擴展運動生成的關鍵幀遮罩瑪巴

KMM: Key Frame Mask Mamba for Extended Motion Generation

摘要

Summary

Support