AV-Link：用于跨模态音视频生成的时间对齐扩散特征

摘要

我们提出了AV-Link，这是一个统一的框架，用于视频到音频和音频到视频的生成，利用冻结视频和音频扩散模型的激活进行时间对齐的跨模态条件。我们框架的关键是融合块，通过一个时间对齐的自注意操作，在我们的视频和音频扩散模型之间实现双向信息交换。与先前使用为其他任务预训练的特征提取器作为条件信号的工作不同，AV-Link可以直接利用在单一框架中获得的互补模态的特征，即利用视频特征生成音频，或利用音频特征生成视频。我们广泛评估了我们的设计选择，并展示了我们的方法实现同步和高质量的音视频内容的能力，展示了其在沉浸式媒体生成应用中的潜力。项目页面：snap-research.github.io/AVLink/

English

We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: snap-research.github.io/AVLink/

AV-Link：用于跨模态音视频生成的时间对齐扩散特征

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

摘要

Support