RealisDance-DiT:迈向野外可控角色动画的简洁而强大的基线
RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild
April 21, 2025
作者: Jingkai Zhou, Yifan Wu, Shikai Li, Min Wei, Chao Fan, Weihua Chen, Wei Jiang, Fan Wang
cs.AI
摘要
可控角色动画仍是一个具有挑战性的问题,尤其是在处理罕见姿态、风格化角色、角色与物体交互、复杂光照以及动态场景方面。针对这些问题,先前的研究主要集中于通过精心设计的旁路网络注入姿态和外观指导,但往往难以泛化到开放世界场景中。本文提出了一种新视角:只要基础模型足够强大,通过简单的模型修改配合灵活的微调策略,就能在很大程度上解决上述挑战,从而向实现真实环境下的可控角色动画迈进一步。具体而言,我们基于Wan-2.1视频基础模型构建了RealisDance-DiT。我们的深入分析表明,广泛采用的参考网络设计对于大规模DiT模型并非最优选择。相反,我们证明对基础模型架构进行最小化修改即可获得出人意料的强大基线。我们进一步提出了低噪声预热和“大批量小迭代”策略,以在微调过程中加速模型收敛,同时最大限度地保留基础模型的先验知识。此外,我们引入了一个新的测试数据集,该数据集捕捉了多样化的现实世界挑战,补充了现有的基准测试(如TikTok数据集和UBC时尚视频数据集),以全面评估所提出的方法。大量实验表明,RealisDance-DiT大幅超越了现有方法。
English
Controllable character animation remains a challenging problem, particularly
in handling rare poses, stylized characters, character-object interactions,
complex illumination, and dynamic scenes. To tackle these issues, prior work
has largely focused on injecting pose and appearance guidance via elaborate
bypass networks, but often struggles to generalize to open-world scenarios. In
this paper, we propose a new perspective that, as long as the foundation model
is powerful enough, straightforward model modifications with flexible
fine-tuning strategies can largely address the above challenges, taking a step
towards controllable character animation in the wild. Specifically, we
introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our
sufficient analysis reveals that the widely adopted Reference Net design is
suboptimal for large-scale DiT models. Instead, we demonstrate that minimal
modifications to the foundation model architecture yield a surprisingly strong
baseline. We further propose the low-noise warmup and "large batches and small
iterations" strategies to accelerate model convergence during fine-tuning while
maximally preserving the priors of the foundation model. In addition, we
introduce a new test dataset that captures diverse real-world challenges,
complementing existing benchmarks such as TikTok dataset and UBC fashion video
dataset, to comprehensively evaluate the proposed method. Extensive experiments
show that RealisDance-DiT outperforms existing methods by a large margin.Summary
AI-Generated Summary