AV-Link: クロスモーダルオーディオビデオ生成のための時間的に整列した拡散特徴

要旨

AV-Linkという統合フレームワークを提案します。このフレームワークは、凍結されたビデオおよびオーディオ拡散モデルの活性化を活用し、時間的に整列したクロスモーダル条件付けのために設計されたビデオからオーディオへ、およびオーディオからビデオへの生成を可能にします。当フレームワークの鍵となるのは、Fusion Blockであり、バックボーンとなるビデオおよびオーディオ拡散モデル間で双方向の情報交換を可能にする、時間的に整列した自己注意操作を実現しています。従来の作業とは異なり、AV-Linkは他のタスクのために事前に学習された特徴抽出器を条件付け信号として使用する代わりに、ビデオ特徴を使用してオーディオを生成したり、オーディオ特徴を使用してビデオを生成するために、補完的なモダリティで得られた特徴を直接活用できる単一のフレームワークです。我々は設計選択肢を詳細に評価し、当手法が同期された高品質なオーディオビジュアルコンテンツを実現する能力を実証し、没入型メディア生成の応用可能性を示しています。プロジェクトページ：snap-research.github.io/AVLink/

English

We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: snap-research.github.io/AVLink/

AV-Link: クロスモーダルオーディオビデオ生成のための時間的に整列した拡散特徴

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

要旨

Summary

Support

Support