KMM: 확장된 동작 생성을 위한 키프레임 마스크 맘바

초록

인간 동작 생성은 생성적 컴퓨터 비전 연구의 최첨단 분야로, 비디오 제작, 게임 개발 및 로봇 조작 등에서 유망한 응용이 있습니다. 최근의 Mamba 아키텍처는 긴 및 복잡한 시퀀스를 효율적으로 모델링하는 데 유망한 결과를 보여주지만, 두 가지 중요한 도전 과제가 남아 있습니다. 첫째, Mamba를 확장된 동작 생성에 직접 적용하는 것은 효과적이지 않습니다. 암시적 메모리의 제한된 용량으로 인해 메모리 감쇠가 발생합니다. 둘째, Mamba는 Transformers와 비교하여 다중 모달 퓨전에 어려움을 겪으며 텍스트 쿼리와의 정렬이 부족하여 종종 방향(왼쪽 또는 오른쪽)을 혼동하거나 더 긴 텍스트 쿼리의 일부를 생략합니다. 이러한 도전 과제를 해결하기 위해 본 논문은 세 가지 주요 기여를 제시합니다. 첫째, 우리는 Key frame Masking Modeling을 특징으로 하는 새로운 아키텍처인 KMM을 소개합니다. 이는 동작 세그먼트에서 주요 동작에 초점을 맞춘 Mamba의 성능을 향상시키기 위해 설계되었습니다. 이 접근 방식은 메모리 감쇠 문제를 해결하고 SSMs에서 전략적인 프레임 수준 마스킹을 사용하는 선도적인 방법을 나타냅니다. 게다가, Mamba에서 다중 모달 퓨전 문제를 해결하고 동작-텍스트 정렬을 개선하기 위해 대조적 학습 패러다임을 설계했습니다. 마지막으로, 우리는 BABEL이라는 대표적인 데이터셋에서 광범위한 실험을 수행하여 FID에서 57% 이상, 매개 변수에서 70% 이상의 감소를 달성하며 이전 최첨단 방법과 비교하여 최고 수준의 성능을 달성했습니다. 프로젝트 웹사이트를 참조하십시오: https://steve-zeyu-zhang.github.io/KMM

English

Human motion generation is a cut-edge area of research in generative computer vision, with promising applications in video creation, game development, and robotic manipulation. The recent Mamba architecture shows promising results in efficiently modeling long and complex sequences, yet two significant challenges remain: Firstly, directly applying Mamba to extended motion generation is ineffective, as the limited capacity of the implicit memory leads to memory decay. Secondly, Mamba struggles with multimodal fusion compared to Transformers, and lack alignment with textual queries, often confusing directions (left or right) or omitting parts of longer text queries. To address these challenges, our paper presents three key contributions: Firstly, we introduce KMM, a novel architecture featuring Key frame Masking Modeling, designed to enhance Mamba's focus on key actions in motion segments. This approach addresses the memory decay problem and represents a pioneering method in customizing strategic frame-level masking in SSMs. Additionally, we designed a contrastive learning paradigm for addressing the multimodal fusion problem in Mamba and improving the motion-text alignment. Finally, we conducted extensive experiments on the go-to dataset, BABEL, achieving state-of-the-art performance with a reduction of more than 57% in FID and 70% parameters compared to previous state-of-the-art methods. See project website: https://steve-zeyu-zhang.github.io/KMM

KMM: 확장된 동작 생성을 위한 키프레임 마스크 맘바

KMM: Key Frame Mask Mamba for Extended Motion Generation

초록

Summary

Support