VidCRAFT3:用于图像到视频生成的摄像头、物体和灯光控制
VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
February 11, 2025
作者: Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu
cs.AI
摘要
最近的图像到视频生成方法已经展示出成功,使得能够控制一个或两个视觉元素,比如摄像机轨迹或物体运动。然而,由于数据和网络效率的限制,这些方法无法提供对多个视觉元素的控制。在本文中,我们介绍了 VidCRAFT3,这是一个新颖的框架,用于精确的图像到视频生成,可以同时控制摄像机运动、物体运动和光照方向。为了更好地解耦对每个视觉元素的控制,我们提出了空间三重注意力变换器,它以对称的方式整合了光照方向、文本和图像。由于大多数现实世界的视频数据集缺乏光照注释,我们构建了一个高质量的合成视频数据集,即 VideoLightingDirection(VLD)数据集。该数据集包括光照方向注释和外观多样的物体,使得 VidCRAFT3 能够有效处理强光传输和反射效果。此外,我们提出了一个三阶段训练策略,消除了需要同时对多个视觉元素(摄像机运动、物体运动和光照方向)进行注释的训练数据的需求。在基准数据集上进行的大量实验表明,VidCRAFT3 在生成高质量视频内容方面的有效性,超过了现有最先进方法,具有更精细的控制粒度和视觉连贯性。所有代码和数据将公开提供。项目页面:https://sixiaozheng.github.io/VidCRAFT3/。
English
Recent image-to-video generation methods have demonstrated success in
enabling control over one or two visual elements, such as camera trajectory or
object motion. However, these methods are unable to offer control over multiple
visual elements due to limitations in data and network efficacy. In this paper,
we introduce VidCRAFT3, a novel framework for precise image-to-video generation
that enables control over camera motion, object motion, and lighting direction
simultaneously. To better decouple control over each visual element, we propose
the Spatial Triple-Attention Transformer, which integrates lighting direction,
text, and image in a symmetric way. Since most real-world video datasets lack
lighting annotations, we construct a high-quality synthetic video dataset, the
VideoLightingDirection (VLD) dataset. This dataset includes lighting direction
annotations and objects of diverse appearance, enabling VidCRAFT3 to
effectively handle strong light transmission and reflection effects.
Additionally, we propose a three-stage training strategy that eliminates the
need for training data annotated with multiple visual elements (camera motion,
object motion, and lighting direction) simultaneously. Extensive experiments on
benchmark datasets demonstrate the efficacy of VidCRAFT3 in producing
high-quality video content, surpassing existing state-of-the-art methods in
terms of control granularity and visual coherence. All code and data will be
publicly available. Project page: https://sixiaozheng.github.io/VidCRAFT3/.Summary
AI-Generated Summary