SkyReels-A2：视频扩散变换器中的全能创作

摘要

本文介绍了SkyReels-A2，一个可控的视频生成框架，能够根据文本提示将任意视觉元素（如角色、物体、背景）组合成合成视频，同时严格保持每个元素与参考图像的一致性。我们将此任务称为元素到视频（E2V），其主要挑战在于保持每个参考元素的保真度、确保场景的连贯组合以及实现自然的输出。为解决这些问题，我们首先设计了一个全面的数据管道，用于构建提示-参考-视频三元组以进行模型训练。接着，我们提出了一种新颖的图像-文本联合嵌入模型，将多元素表示注入生成过程，平衡元素特定的一致性与全局连贯性及文本对齐。我们还优化了推理管道，以提高速度和输出稳定性。此外，我们引入了一个精心策划的基准，即A2 Bench，用于系统评估。实验表明，我们的框架能够生成多样化的高质量视频，并实现精确的元素控制。SkyReels-A2是首个开源的商业级E2V生成模型，在性能上优于先进的闭源商业模型。我们预期SkyReels-A2将推动诸如戏剧和虚拟电子商务等创意应用的发展，拓展可控视频生成的边界。

English

This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element. We term this task elements-to-video (E2V), whose primary challenges lie in preserving the fidelity of each reference element, ensuring coherent composition of the scene, and achieving natural outputs. To address these, we first design a comprehensive data pipeline to construct prompt-reference-video triplets for model training. Next, we propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment. We also optimize the inference pipeline for both speed and output stability. Moreover, we introduce a carefully curated benchmark for systematic evaluation, i.e, A2 Bench. Experiments demonstrate that our framework can generate diverse, high-quality videos with precise element control. SkyReels-A2 is the first open-source commercial grade model for the generation of E2V, performing favorably against advanced closed-source commercial models. We anticipate SkyReels-A2 will advance creative applications such as drama and virtual e-commerce, pushing the boundaries of controllable video generation.

SkyReels-A2：视频扩散变换器中的全能创作

SkyReels-A2: Compose Anything in Video Diffusion Transformers

摘要

Summary

Support

Support