Any2Caption：将任意条件解读为字幕以实现可控视频生成

摘要

针对当前视频生成领域在精确理解用户意图方面存在的瓶颈，我们提出了Any2Caption，一个可在任意条件下实现可控视频生成的新颖框架。其核心思想在于将多样化的条件解析步骤与视频合成步骤解耦。通过利用现代多模态大语言模型（MLLMs），Any2Caption能够将文本、图像、视频及诸如区域、运动和相机姿态等特定提示等多元输入，转化为密集且结构化的描述，为骨干视频生成器提供更优的指导。此外，我们还推出了Any2CapIns，一个包含337K实例和407K条件的大规模数据集，专为任意条件到描述的指令微调而设计。全面评估表明，我们的系统在现有视频生成模型的多个方面，均显著提升了可控性与视频质量。项目页面：https://sqwu.top/Any2Cap/

English

To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: https://sqwu.top/Any2Cap/

Any2Caption：将任意条件解读为字幕以实现可控视频生成

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

摘要

Summary

Support

Support