ChatPaper.aiChatPaper

UniPose:一个统一的多模态框架,用于人体姿势理解、生成和编辑。

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

November 25, 2024
作者: Yiheng Li, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen
cs.AI

摘要

人体姿势在数字时代起着至关重要的作用。尽管最近的研究在理解和生成人体姿势方面取得了令人瞩目的进展,但它们通常仅支持单一控制信号模态,并且在孤立环境中运行,限制了它们在现实场景中的应用。本文提出了UniPose,这是一个利用大型语言模型(LLMs)来理解、生成和编辑人体姿势的框架,涵盖图像、文本和3D SMPL姿势等多种模态。具体而言,我们应用姿势分词器将3D姿势转换为离散的姿势标记,实现与统一词汇表中的LLM的无缝集成。为进一步增强对细粒度姿势感知能力,我们为UniPose提供了一组视觉编码器,其中包括一个特定于姿势的视觉编码器。受益于统一的学习策略,UniPose有效地在不同的与姿势相关的任务之间传递知识,适应未知任务,并展现出扩展的能力。这项工作是构建姿势理解、生成和编辑通用框架的首次尝试。广泛的实验突显了UniPose在各种与姿势相关任务中具有竞争力甚至优越的表现。
English
Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.

Summary

AI-Generated Summary

PDF114November 28, 2024