ChatPaper.aiChatPaper

MoCha:迈向电影级说话角色合成

MoCha: Towards Movie-Grade Talking Character Synthesis

March 30, 2025
作者: Cong Wei, Bo Sun, Haoyu Ma, Ji Hou, Felix Juefei-Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, Animesh Sinha, Peter Vajda, Wenhu Chen
cs.AI

摘要

近期视频生成技术虽在运动真实性方面取得了显著进展,却常忽视角色驱动的叙事能力,而这对于自动化电影与动画生成至关重要。我们提出了“对话角色”这一更为现实的任务,旨在直接从语音和文本生成角色对话动画。与仅关注面部的“对话头部”不同,“对话角色”致力于生成一个或多个角色的完整肖像,超越面部区域。本文中,我们首次提出MoCha,专为生成对话角色而设计。为确保视频与语音的精确同步,我们引入了一种语音-视频窗口注意力机制,有效对齐语音与视频标记。针对大规模语音标注视频数据集稀缺的问题,我们提出了一种联合训练策略,同时利用语音标注和文本标注的视频数据,显著提升了跨多样角色动作的泛化能力。此外,我们设计了带有角色标签的结构化提示模板,首次实现了基于回合制对话的多角色交流,使AI生成的角色能够进行上下文感知的对话,保持电影般的连贯性。通过广泛的定性与定量评估,包括人类偏好研究和基准对比,MoCha在AI生成电影叙事领域树立了新标杆,展现出卓越的真实感、表现力、可控性和泛化能力。
English
Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.

Summary

AI-Generated Summary

PDF13111April 1, 2025