UniPose：一個統一的多模態框架，用於人體姿勢理解、生成和編輯。

摘要

在數位時代中，人體姿勢扮演著至關重要的角色。儘管最近的研究在理解和生成人體姿勢方面取得了令人印象深刻的進展，但它們通常僅支持單一模態的控制信號並且運作獨立，限制了它們在現實場景中的應用。本文提出了UniPose，一個利用大型語言模型（LLMs）來理解、生成和編輯人體姿勢的框架，跨越各種模態，包括圖像、文本和3D SMPL姿勢。具體而言，我們應用姿勢分詞器將3D姿勢轉換為離散的姿勢標記，實現與統一詞彙表中的LLM的無縫集成。為了進一步增強細粒度的姿勢感知能力，我們為UniPose提供了一組視覺編碼器，其中包括一個特定於姿勢的視覺編碼器。受益於統一的學習策略，UniPose有效地在不同的與姿勢相關的任務之間轉移知識，適應未見過的任務，並展現了擴展的能力。這項工作是建立一個通用框架用於姿勢理解、生成和編輯的首次嘗試。廣泛的實驗突顯了UniPose在各種與姿勢相關的任務中具有競爭力甚至優越的表現。

English

Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.

UniPose：一個統一的多模態框架，用於人體姿勢理解、生成和編輯。

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

摘要

Summary

Support