MagicInfinite:用你的话语与声音生成无限对话视频
MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice
March 7, 2025
作者: Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, Wei Li, Michael Lingelbach, Daquan Zhou
cs.AI
摘要
我们推出MagicInfinite,一种创新的扩散Transformer(DiT)框架,突破了传统肖像动画的限制,能够在多种角色类型——包括写实人类、全身形象及风格化动漫角色——上实现高保真效果。该框架支持多样化的面部姿态,如背面视角,并能通过输入掩码对单角色或多角色进行动画处理,确保多角色场景中说话者的精准指定。我们的方法通过三项创新解决关键挑战:(1) 采用3D全注意力机制结合滑动窗口去噪策略,实现无限视频生成,保证跨多种角色风格的时间连贯性与视觉质量;(2) 实施两阶段课程学习方案,整合音频用于唇形同步、文本增强表现力动态、参考图像维护身份一致性,从而灵活控制长序列的多模态输出;(3) 应用区域特定掩码与自适应损失函数,平衡全局文本控制与局部音频引导,支持特定说话者的动画生成。通过创新的统一步骤与cfg蒸馏技术,效率显著提升,推理速度较基础模型提升20倍:在8块H100 GPU上,10秒内生成540x540p的10秒视频,或30秒内生成720x720p视频,且无质量损失。基于我们新基准的评估显示,MagicInfinite在音频-唇形同步、身份保持及动作自然度方面,在多种场景下均展现出卓越性能。该框架已公开于https://www.hedra.com/,示例可见于https://magicinfinite.github.io/。
English
We present MagicInfinite, a novel diffusion Transformer (DiT) framework that
overcomes traditional portrait animation limitations, delivering high-fidelity
results across diverse character types-realistic humans, full-body figures, and
stylized anime characters. It supports varied facial poses, including
back-facing views, and animates single or multiple characters with input masks
for precise speaker designation in multi-character scenes. Our approach tackles
key challenges with three innovations: (1) 3D full-attention mechanisms with a
sliding window denoising strategy, enabling infinite video generation with
temporal coherence and visual quality across diverse character styles; (2) a
two-stage curriculum learning scheme, integrating audio for lip sync, text for
expressive dynamics, and reference images for identity preservation, enabling
flexible multi-modal control over long sequences; and (3) region-specific masks
with adaptive loss functions to balance global textual control and local audio
guidance, supporting speaker-specific animations. Efficiency is enhanced via
our innovative unified step and cfg distillation techniques, achieving a 20x
inference speed boost over the basemodel: generating a 10 second 540x540p video
in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs, without quality loss.
Evaluations on our new benchmark demonstrate MagicInfinite's superiority in
audio-lip synchronization, identity preservation, and motion naturalness across
diverse scenarios. It is publicly available at https://www.hedra.com/, with
examples at https://magicinfinite.github.io/.Summary
AI-Generated Summary