^RFLAV:面向无限音视频生成的滚动流匹配技术
^RFLAV: Rolling Flow matching for infinite Audio Video generation
March 11, 2025
作者: Alex Ergasti, Giuseppe Gabriele Tarollo, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
cs.AI
摘要
联合音视频(AV)生成在生成式人工智能领域仍面临重大挑战,主要源于三大关键需求:生成样本的质量、无缝的多模态同步与时间一致性——即音频与视觉数据的相互匹配,以及无限时长的视频生成。本文提出了一种基于Transformer的创新架构,全面应对AV生成中的核心难题。我们探索了三种不同的跨模态交互模块,其中轻量级的时间融合模块脱颖而出,成为对齐音频与视觉模态最为有效且计算高效的方法。实验结果表明,该模型在多模态AV生成任务中超越了现有的最先进模型。我们的代码与模型检查点已公开于https://github.com/ErgastiAlex/R-FLAV。
English
Joint audio-video (AV) generation is still a significant challenge in
generative AI, primarily due to three critical requirements: quality of the
generated samples, seamless multimodal synchronization and temporal coherence,
with audio tracks that match the visual data and vice versa, and limitless
video duration. In this paper, we present , a novel transformer-based
architecture that addresses all the key challenges of AV generation. We explore
three distinct cross modality interaction modules, with our lightweight
temporal fusion module emerging as the most effective and computationally
efficient approach for aligning audio and visual modalities. Our experimental
results demonstrate that outperforms existing state-of-the-art models
in multimodal AV generation tasks. Our code and checkpoints are available at
https://github.com/ErgastiAlex/R-FLAV.Summary
AI-Generated Summary