Any2Caption:将任意条件解读为字幕以实现可控视频生成
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
March 31, 2025
作者: Shengqiong Wu, Weicai Ye, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Shuicheng Yan, Hao Fei, Tat-Seng Chua
cs.AI
摘要
针对当前视频生成领域在精确理解用户意图方面存在的瓶颈,我们提出了Any2Caption,一个可在任意条件下实现可控视频生成的新颖框架。其核心思想在于将多样化的条件解析步骤与视频合成步骤解耦。通过利用现代多模态大语言模型(MLLMs),Any2Caption能够将文本、图像、视频及诸如区域、运动和相机姿态等特定提示等多元输入,转化为密集且结构化的描述,为骨干视频生成器提供更优的指导。此外,我们还推出了Any2CapIns,一个包含337K实例和407K条件的大规模数据集,专为任意条件到描述的指令微调而设计。全面评估表明,我们的系统在现有视频生成模型的多个方面,均显著提升了可控性与视频质量。项目页面:https://sqwu.top/Any2Cap/
English
To address the bottleneck of accurate user intent interpretation within the
current video generation community, we present Any2Caption, a novel framework
for controllable video generation under any condition. The key idea is to
decouple various condition interpretation steps from the video synthesis step.
By leveraging modern multimodal large language models (MLLMs), Any2Caption
interprets diverse inputs--text, images, videos, and specialized cues such as
region, motion, and camera poses--into dense, structured captions that offer
backbone video generators with better guidance. We also introduce Any2CapIns, a
large-scale dataset with 337K instances and 407K conditions for
any-condition-to-caption instruction tuning. Comprehensive evaluations
demonstrate significant improvements of our system in controllability and video
quality across various aspects of existing video generation models. Project
Page: https://sqwu.top/Any2Cap/Summary
AI-Generated Summary