STIV:可擴展的文本和圖像條件視頻生成
STIV: Scalable Text and Image Conditioned Video Generation
December 10, 2024
作者: Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, Cha Chen, Yiran Fei, Yifan Jiang, Lezhi Li, Yizhou Sun, Kai-Wei Chang, Yinfei Yang
cs.AI
摘要
在影片生成領域取得了顯著的進展,但仍迫切需要一個清晰、系統化的配方,來引導強健且可擴展模型的開發。在這項工作中,我們提出了一項全面研究,系統地探索了模型架構、訓練配方和數據整理策略之間的相互作用,最終形成了一種簡單且可擴展的文本-圖像條件影片生成方法,名為STIV。我們的框架通過幀替換將圖像條件整合到擴散Transformer(DiT)中,同時通過聯合圖像-文本條件無分類器指導來整合文本條件。這種設計使STIV能夠同時執行文本到影片(T2V)和文本-圖像到影片(TI2V)任務。此外,STIV可以輕鬆擴展到各種應用,如影片預測、幀插補、多視角生成和長影片生成等。通過對T2I、T2V和TI2V進行全面的消融研究,STIV表現出強大的性能,儘管其設計簡單。一個具有512分辨率的87億模型在VBench T2V上達到83.1的分數,超越了CogVideoX-5B、Pika、Kling和Gen-3等領先的開源和封閉源模型。相同大小的模型在512分辨率下還實現了VBench I2V任務的最新成果90.1。通過提供一個透明且可擴展的配方來構建尖端影片生成模型,我們旨在賦予未來研究更多的能量,並加速朝著更多功能和可靠的影片生成解決方案的進展。
English
The field of video generation has made remarkable advancements, yet there
remains a pressing need for a clear, systematic recipe that can guide the
development of robust and scalable models. In this work, we present a
comprehensive study that systematically explores the interplay of model
architectures, training recipes, and data curation strategies, culminating in a
simple and scalable text-image-conditioned video generation method, named STIV.
Our framework integrates image condition into a Diffusion Transformer (DiT)
through frame replacement, while incorporating text conditioning via a joint
image-text conditional classifier-free guidance. This design enables STIV to
perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks
simultaneously. Additionally, STIV can be easily extended to various
applications, such as video prediction, frame interpolation, multi-view
generation, and long video generation, etc. With comprehensive ablation studies
on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple
design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V,
surpassing both leading open and closed-source models like CogVideoX-5B, Pika,
Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result
of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and
extensible recipe for building cutting-edge video generation models, we aim to
empower future research and accelerate progress toward more versatile and
reliable video generation solutions.Summary
AI-Generated Summary