增強複雜動作視頻生成的運動控制
Motion Control for Enhanced Complex Action Video Generation
November 13, 2024
作者: Qiang Zhou, Shaofeng Zhang, Nianzu Yang, Ye Qian, Hao Li
cs.AI
摘要
現有的文本轉視頻(T2V)模型通常在生成具有足夠明顯或複雜動作的視頻方面遇到困難。一個關鍵限制在於文本提示無法精確傳達複雜運動細節。為了應對這一問題,我們提出了一個新穎的框架,名為MVideo,旨在生成具有精確、流暢動作的長時視頻。MVideo通過將遮罩序列作為額外的運動條件輸入來克服文本提示的限制,提供更清晰、更準確地表示預期動作的方法。MVideo利用GroundingDINO和SAM2等基礎視覺模型,自動生成遮罩序列,提高了效率和韌性。我們的結果表明,在訓練後,MVideo能夠有效地將文本提示與運動條件對齊,以生成同時滿足兩者標準的視頻。這種雙重控制機制使得更動態的視頻生成成為可能,可以獨立修改文本提示或運動條件,或同時修改兩者。此外,MVideo支持運動條件的編輯和組合,有助於生成具有更複雜動作的視頻。因此,MVideo推動了T2V運動生成,為當前視頻傳播模型中改進動作描述設立了一個強有力的基準。我們的項目頁面可在https://mvideo-v1.github.io/找到。
English
Existing text-to-video (T2V) models often struggle with generating videos
with sufficiently pronounced or complex actions. A key limitation lies in the
text prompt's inability to precisely convey intricate motion details. To
address this, we propose a novel framework, MVideo, designed to produce
long-duration videos with precise, fluid actions. MVideo overcomes the
limitations of text prompts by incorporating mask sequences as an additional
motion condition input, providing a clearer, more accurate representation of
intended actions. Leveraging foundational vision models such as GroundingDINO
and SAM2, MVideo automatically generates mask sequences, enhancing both
efficiency and robustness. Our results demonstrate that, after training, MVideo
effectively aligns text prompts with motion conditions to produce videos that
simultaneously meet both criteria. This dual control mechanism allows for more
dynamic video generation by enabling alterations to either the text prompt or
motion condition independently, or both in tandem. Furthermore, MVideo supports
motion condition editing and composition, facilitating the generation of videos
with more complex actions. MVideo thus advances T2V motion generation, setting
a strong benchmark for improved action depiction in current video diffusion
models. Our project page is available at https://mvideo-v1.github.io/.Summary
AI-Generated Summary