JavisDiT:联合音视频扩散变换器与分层时空先验同步
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
March 30, 2025
作者: Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, Tat-Seng Chua
cs.AI
摘要
本文介绍了JavisDiT,一种新颖的联合音视频扩散Transformer,专为同步音视频生成(JAVG)而设计。基于强大的扩散Transformer(DiT)架构,JavisDiT能够从开放式用户提示中同时生成高质量的音频和视频内容。为确保最佳同步性,我们通过分层时空同步先验(HiST-Sypo)估计器引入了一种细粒度的时空对齐机制。该模块提取全局和细粒度的时空先验,指导视觉和听觉组件之间的同步。此外,我们提出了一个新的基准测试集JavisBench,包含10,140个高质量带文本描述的声画视频,涵盖多样场景和复杂现实世界情境。进一步地,我们专门设计了一种稳健的指标,用于评估生成音视频对在现实世界复杂内容中的同步性。实验结果表明,JavisDiT在确保高质量生成和精确同步方面显著优于现有方法,为JAVG任务树立了新标准。我们的代码、模型和数据集将在https://javisdit.github.io/上公开提供。
English
This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion
Transformer designed for synchronized audio-video generation (JAVG). Built upon
the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to
generate high-quality audio and video content simultaneously from open-ended
user prompts. To ensure optimal synchronization, we introduce a fine-grained
spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal
Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and
fine-grained spatio-temporal priors, guiding the synchronization between the
visual and auditory components. Furthermore, we propose a new benchmark,
JavisBench, consisting of 10,140 high-quality text-captioned sounding videos
spanning diverse scenes and complex real-world scenarios. Further, we
specifically devise a robust metric for evaluating the synchronization between
generated audio-video pairs in real-world complex content. Experimental results
demonstrate that JavisDiT significantly outperforms existing methods by
ensuring both high-quality generation and precise synchronization, setting a
new standard for JAVG tasks. Our code, model, and dataset will be made publicly
available at https://javisdit.github.io/.Summary
AI-Generated Summary