Mimir:改进视频扩散模型以实现精准文本理解
Mimir: Improving Video Diffusion Models for Precise Text Understanding
December 4, 2024
作者: Shuai Tan, Biao Gong, Yutong Feng, Kecheng Zheng, Dandan Zheng, Shuwei Shi, Yujun Shen, Jingdong Chen, Ming Yang
cs.AI
摘要
文本在视频生成中起着关键的控制信号作用,这是由于其叙事性质。为了将文本描述渲染成视频片段,当前的视频扩散模型借鉴了文本编码器的特征,但在文本理解方面存在局限性。大型语言模型(LLMs)的最新成功展示了仅解码器变压器的强大能力,为文本到视频(T2V)生成提供了三个明显的好处,即由于卓越的可扩展性而产生的精确文本理解能力,通过下一个标记预测实现的超越输入文本的想象力,以及通过指令调整灵活地优先考虑用户兴趣。然而,由于两种不同的文本建模范式产生的特征分布差异,阻碍了LLMs在已建立的T2V模型中的直接使用。本文通过Mimir解决了这一挑战,Mimir是一个端到端训练框架,具有精心设计的标记融合器,用于协调文本编码器和LLMs的输出。这样的设计使得T2V模型能够充分利用学习到的视频先验知识,同时又能充分利用LLMs的与文本相关的能力。广泛的定量和定性结果展示了Mimir在生成具有出色文本理解能力的高质量视频方面的有效性,特别是在处理短标题和管理变化运动时。项目页面:https://lucaria-academy.github.io/Mimir/
English
Text serves as the key control signal in video generation due to its
narrative nature. To render text descriptions into video clips, current video
diffusion models borrow features from text encoders yet struggle with limited
text comprehension. The recent success of large language models (LLMs)
showcases the power of decoder-only transformers, which offers three clear
benefits for text-to-video (T2V) generation, namely, precise text understanding
resulting from the superior scalability, imagination beyond the input text
enabled by next token prediction, and flexibility to prioritize user interests
through instruction tuning. Nevertheless, the feature distribution gap emerging
from the two different text modeling paradigms hinders the direct use of LLMs
in established T2V models. This work addresses this challenge with Mimir, an
end-to-end training framework featuring a carefully tailored token fuser to
harmonize the outputs from text encoders and LLMs. Such a design allows the T2V
model to fully leverage learned video priors while capitalizing on the
text-related capability of LLMs. Extensive quantitative and qualitative results
demonstrate the effectiveness of Mimir in generating high-quality videos with
excellent text comprehension, especially when processing short captions and
managing shifting motions. Project page:
https://lucaria-academy.github.io/Mimir/Summary
AI-Generated Summary