ChatPaper.aiChatPaper

RealisDance-DiT:邁向可控角色動畫的簡潔而強大的基線模型

RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild

April 21, 2025
作者: Jingkai Zhou, Yifan Wu, Shikai Li, Min Wei, Chao Fan, Weihua Chen, Wei Jiang, Fan Wang
cs.AI

摘要

可控角色動畫仍是一個具有挑戰性的問題,尤其是在處理罕見姿勢、風格化角色、角色與物體互動、複雜光照以及動態場景方面。為應對這些問題,先前的研究主要通過精心設計的旁路網絡注入姿勢和外觀指導,但往往難以泛化到開放世界場景中。本文提出了一種新的視角:只要基礎模型足夠強大,通過簡單的模型修改和靈活的微調策略,就能在很大程度上解決上述挑戰,從而向可控角色動畫在實際場景中的應用邁進一步。具體而言,我們基於Wan-2.1視頻基礎模型,提出了RealisDance-DiT。我們深入分析發現,廣泛採用的Reference Net設計對於大規模DiT模型並非最優。相反,我們證明對基礎模型架構進行最小限度的修改即可獲得一個出人意料的強基線。我們進一步提出了低噪聲熱身和「大批量小迭代」策略,以在微調過程中加速模型收斂,同時最大限度地保留基礎模型的先驗知識。此外,我們引入了一個新的測試數據集,該數據集捕捉了多樣化的現實世界挑戰,補充了現有的基準數據集(如TikTok數據集和UBC時尚視頻數據集),以全面評估所提出的方法。大量實驗表明,RealisDance-DiT大幅超越了現有方法。
English
Controllable character animation remains a challenging problem, particularly in handling rare poses, stylized characters, character-object interactions, complex illumination, and dynamic scenes. To tackle these issues, prior work has largely focused on injecting pose and appearance guidance via elaborate bypass networks, but often struggles to generalize to open-world scenarios. In this paper, we propose a new perspective that, as long as the foundation model is powerful enough, straightforward model modifications with flexible fine-tuning strategies can largely address the above challenges, taking a step towards controllable character animation in the wild. Specifically, we introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our sufficient analysis reveals that the widely adopted Reference Net design is suboptimal for large-scale DiT models. Instead, we demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline. We further propose the low-noise warmup and "large batches and small iterations" strategies to accelerate model convergence during fine-tuning while maximally preserving the priors of the foundation model. In addition, we introduce a new test dataset that captures diverse real-world challenges, complementing existing benchmarks such as TikTok dataset and UBC fashion video dataset, to comprehensively evaluate the proposed method. Extensive experiments show that RealisDance-DiT outperforms existing methods by a large margin.

Summary

AI-Generated Summary

PDF92April 23, 2025