AV-Link:用于跨模态音视频生成的时间对齐扩散特征
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
December 19, 2024
作者: Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov
cs.AI
摘要
我们提出了AV-Link,这是一个统一的框架,用于视频到音频和音频到视频的生成,利用冻结视频和音频扩散模型的激活进行时间对齐的跨模态条件。我们框架的关键是融合块,通过一个时间对齐的自注意操作,在我们的视频和音频扩散模型之间实现双向信息交换。与先前使用为其他任务预训练的特征提取器作为条件信号的工作不同,AV-Link可以直接利用在单一框架中获得的互补模态的特征,即利用视频特征生成音频,或利用音频特征生成视频。我们广泛评估了我们的设计选择,并展示了我们的方法实现同步和高质量的音视频内容的能力,展示了其在沉浸式媒体生成应用中的潜力。项目页面:snap-research.github.io/AVLink/
English
We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video
generation that leverages the activations of frozen video and audio diffusion
models for temporally-aligned cross-modal conditioning. The key to our
framework is a Fusion Block that enables bidirectional information exchange
between our backbone video and audio diffusion models through a
temporally-aligned self attention operation. Unlike prior work that uses
feature extractors pretrained for other tasks for the conditioning signal,
AV-Link can directly leverage features obtained by the complementary modality
in a single framework i.e. video features to generate audio, or audio features
to generate video. We extensively evaluate our design choices and demonstrate
the ability of our method to achieve synchronized and high-quality audiovisual
content, showcasing its potential for applications in immersive media
generation. Project Page: snap-research.github.io/AVLink/Summary
AI-Generated Summary