^RFLAV：面向无限音视频生成的滚动流匹配技术

摘要

联合音视频（AV）生成在生成式人工智能领域仍面临重大挑战，主要源于三大关键需求：生成样本的质量、无缝的多模态同步与时间一致性——即音频与视觉数据的相互匹配，以及无限时长的视频生成。本文提出了一种基于Transformer的创新架构，全面应对AV生成中的核心难题。我们探索了三种不同的跨模态交互模块，其中轻量级的时间融合模块脱颖而出，成为对齐音频与视觉模态最为有效且计算高效的方法。实验结果表明，该模型在多模态AV生成任务中超越了现有的最先进模型。我们的代码与模型检查点已公开于https://github.com/ErgastiAlex/R-FLAV。

English

Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present , a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at https://github.com/ErgastiAlex/R-FLAV.

^RFLAV：面向无限音视频生成的滚动流匹配技术

^RFLAV: Rolling Flow matching for infinite Audio Video generation

摘要

Summary

Support