增強複雜動作視頻生成的運動控制

摘要

現有的文本轉視頻（T2V）模型通常在生成具有足夠明顯或複雜動作的視頻方面遇到困難。一個關鍵限制在於文本提示無法精確傳達複雜運動細節。為了應對這一問題，我們提出了一個新穎的框架，名為MVideo，旨在生成具有精確、流暢動作的長時視頻。MVideo通過將遮罩序列作為額外的運動條件輸入來克服文本提示的限制，提供更清晰、更準確地表示預期動作的方法。MVideo利用GroundingDINO和SAM2等基礎視覺模型，自動生成遮罩序列，提高了效率和韌性。我們的結果表明，在訓練後，MVideo能夠有效地將文本提示與運動條件對齊，以生成同時滿足兩者標準的視頻。這種雙重控制機制使得更動態的視頻生成成為可能，可以獨立修改文本提示或運動條件，或同時修改兩者。此外，MVideo支持運動條件的編輯和組合，有助於生成具有更複雜動作的視頻。因此，MVideo推動了T2V運動生成，為當前視頻傳播模型中改進動作描述設立了一個強有力的基準。我們的項目頁面可在https://mvideo-v1.github.io/找到。

English

Existing text-to-video (T2V) models often struggle with generating videos with sufficiently pronounced or complex actions. A key limitation lies in the text prompt's inability to precisely convey intricate motion details. To address this, we propose a novel framework, MVideo, designed to produce long-duration videos with precise, fluid actions. MVideo overcomes the limitations of text prompts by incorporating mask sequences as an additional motion condition input, providing a clearer, more accurate representation of intended actions. Leveraging foundational vision models such as GroundingDINO and SAM2, MVideo automatically generates mask sequences, enhancing both efficiency and robustness. Our results demonstrate that, after training, MVideo effectively aligns text prompts with motion conditions to produce videos that simultaneously meet both criteria. This dual control mechanism allows for more dynamic video generation by enabling alterations to either the text prompt or motion condition independently, or both in tandem. Furthermore, MVideo supports motion condition editing and composition, facilitating the generation of videos with more complex actions. MVideo thus advances T2V motion generation, setting a strong benchmark for improved action depiction in current video diffusion models. Our project page is available at https://mvideo-v1.github.io/.

增強複雜動作視頻生成的運動控制

Motion Control for Enhanced Complex Action Video Generation

摘要

Support