ChatPaper.aiChatPaper

RealisDance-DiT:迈向野外可控角色动画的简洁而强大的基线

RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild

April 21, 2025
作者: Jingkai Zhou, Yifan Wu, Shikai Li, Min Wei, Chao Fan, Weihua Chen, Wei Jiang, Fan Wang
cs.AI

摘要

可控角色动画仍是一个具有挑战性的问题,尤其是在处理罕见姿态、风格化角色、角色与物体交互、复杂光照以及动态场景方面。针对这些问题,先前的研究主要集中于通过精心设计的旁路网络注入姿态和外观指导,但往往难以泛化到开放世界场景中。本文提出了一种新视角:只要基础模型足够强大,通过简单的模型修改配合灵活的微调策略,就能在很大程度上解决上述挑战,从而向实现真实环境下的可控角色动画迈进一步。具体而言,我们基于Wan-2.1视频基础模型构建了RealisDance-DiT。我们的深入分析表明,广泛采用的参考网络设计对于大规模DiT模型并非最优选择。相反,我们证明对基础模型架构进行最小化修改即可获得出人意料的强大基线。我们进一步提出了低噪声预热和“大批量小迭代”策略,以在微调过程中加速模型收敛,同时最大限度地保留基础模型的先验知识。此外,我们引入了一个新的测试数据集,该数据集捕捉了多样化的现实世界挑战,补充了现有的基准测试(如TikTok数据集和UBC时尚视频数据集),以全面评估所提出的方法。大量实验表明,RealisDance-DiT大幅超越了现有方法。
English
Controllable character animation remains a challenging problem, particularly in handling rare poses, stylized characters, character-object interactions, complex illumination, and dynamic scenes. To tackle these issues, prior work has largely focused on injecting pose and appearance guidance via elaborate bypass networks, but often struggles to generalize to open-world scenarios. In this paper, we propose a new perspective that, as long as the foundation model is powerful enough, straightforward model modifications with flexible fine-tuning strategies can largely address the above challenges, taking a step towards controllable character animation in the wild. Specifically, we introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our sufficient analysis reveals that the widely adopted Reference Net design is suboptimal for large-scale DiT models. Instead, we demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline. We further propose the low-noise warmup and "large batches and small iterations" strategies to accelerate model convergence during fine-tuning while maximally preserving the priors of the foundation model. In addition, we introduce a new test dataset that captures diverse real-world challenges, complementing existing benchmarks such as TikTok dataset and UBC fashion video dataset, to comprehensively evaluate the proposed method. Extensive experiments show that RealisDance-DiT outperforms existing methods by a large margin.

Summary

AI-Generated Summary

PDF72April 23, 2025