Mimir：改善影片擴散模型以提升精確文本理解

摘要

文字在影片生成中扮演關鍵的控制訊號角色，這是由於其敘事性質。為了將文字描述轉換為影片片段，目前的影片擴散模型借用了來自文字編碼器的特徵，但在有限的文字理解方面遇到困難。大型語言模型（LLMs）最近的成功展示了僅解碼器變壓器的威力，為文字到影片（T2V）生成提供了三個明顯的好處，即優越的可擴展性帶來的精確文字理解，通過下一個標記預測實現的超越輸入文字的想像力，以及通過指導調整來優先考慮用戶興趣的靈活性。然而，由於兩種不同的文字建模範式產生的特徵分佈差距，阻礙了LLMs在已建立的T2V模型中的直接應用。本研究通過Mimir來應對這一挑戰，這是一個端對端的訓練框架，具有精心設計的標記融合器，以協調來自文字編碼器和LLMs的輸出。這種設計使得T2V模型能夠充分利用所學的影片先驗知識，同時又能充分利用LLMs的與文字相關的能力。廣泛的定量和定性結果顯示了Mimir在生成具有出色文字理解的高質量影片方面的有效性，特別是在處理短字幕和管理變換運動時。專案頁面：https://lucaria-academy.github.io/Mimir/

English

Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of Mimir in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. Project page: https://lucaria-academy.github.io/Mimir/

Mimir：改善影片擴散模型以提升精確文本理解

Mimir: Improving Video Diffusion Models for Precise Text Understanding

摘要

Support