AV-Link:用於跨模態音視頻生成的時間對齊擴散特徵
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
December 19, 2024
作者: Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov
cs.AI
摘要
我們提出了AV-Link,這是一個統一的框架,用於視訊轉音訊和音訊轉視訊的生成,它利用凍結的視訊和音訊擴散模型的啟動,進行時間對齊的跨模態條件。我們框架的關鍵是一個融合塊,通過時間對齊的自注意操作,實現了我們的骨幹視訊和音訊擴散模型之間的雙向信息交換。與先前使用為其他任務預訓練的特徵提取器作為條件信號的工作不同,AV-Link 可以直接利用在單一框架中獲得的互補模態的特徵,即使用視訊特徵生成音訊,或使用音訊特徵生成視訊。我們廣泛評估了我們的設計選擇,並展示了我們的方法實現同步和高質量的音視頻內容的能力,展示了其在沉浸式媒體生成應用中的潛力。項目頁面:snap-research.github.io/AVLink/
English
We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video
generation that leverages the activations of frozen video and audio diffusion
models for temporally-aligned cross-modal conditioning. The key to our
framework is a Fusion Block that enables bidirectional information exchange
between our backbone video and audio diffusion models through a
temporally-aligned self attention operation. Unlike prior work that uses
feature extractors pretrained for other tasks for the conditioning signal,
AV-Link can directly leverage features obtained by the complementary modality
in a single framework i.e. video features to generate audio, or audio features
to generate video. We extensively evaluate our design choices and demonstrate
the ability of our method to achieve synchronized and high-quality audiovisual
content, showcasing its potential for applications in immersive media
generation. Project Page: snap-research.github.io/AVLink/Summary
AI-Generated Summary