可扩展文本和图像条件视频生成(STIV)
STIV: Scalable Text and Image Conditioned Video Generation
December 10, 2024
作者: Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, Cha Chen, Yiran Fei, Yifan Jiang, Lezhi Li, Yizhou Sun, Kai-Wei Chang, Yinfei Yang
cs.AI
摘要
视频生成领域取得了显著进展,但仍急需清晰、系统的指南,以指导健壮且可扩展模型的开发。在这项工作中,我们提出了一项全面研究,系统地探讨了模型架构、训练方法和数据整理策略之间的相互作用,最终形成了一种简单且可扩展的文本图像条件视频生成方法,命名为STIV。我们的框架通过帧替换将图像条件集成到扩散Transformer(DiT)中,同时通过联合图像文本条件分类器的无指导引导,实现了文本条件。这种设计使STIV能够同时执行文本到视频(T2V)和文本图像到视频(TI2V)任务。此外,STIV可以轻松扩展到各种应用,如视频预测、帧插值、多视角生成和长视频生成等。通过对T2I、T2V和TI2V的全面消融研究,尽管设计简单,STIV表现出色。一个拥有8.7B模型和512分辨率的模型在VBench T2V上取得83.1的成绩,超过了CogVideoX-5B、Pika、Kling和Gen-3等领先的开源和闭源模型。相同规模的模型在512分辨率的VBench I2V任务上也取得了90.1的最新成果。通过提供一个透明且可扩展的指南,用于构建尖端视频生成模型,我们旨在赋予未来研究力量,并加速朝着更多功能和可靠的视频生成解决方案的进展。
English
The field of video generation has made remarkable advancements, yet there
remains a pressing need for a clear, systematic recipe that can guide the
development of robust and scalable models. In this work, we present a
comprehensive study that systematically explores the interplay of model
architectures, training recipes, and data curation strategies, culminating in a
simple and scalable text-image-conditioned video generation method, named STIV.
Our framework integrates image condition into a Diffusion Transformer (DiT)
through frame replacement, while incorporating text conditioning via a joint
image-text conditional classifier-free guidance. This design enables STIV to
perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks
simultaneously. Additionally, STIV can be easily extended to various
applications, such as video prediction, frame interpolation, multi-view
generation, and long video generation, etc. With comprehensive ablation studies
on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple
design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V,
surpassing both leading open and closed-source models like CogVideoX-5B, Pika,
Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result
of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and
extensible recipe for building cutting-edge video generation models, we aim to
empower future research and accelerate progress toward more versatile and
reliable video generation solutions.Summary
AI-Generated Summary