具有自回歸的視頻音頻時間對齊
Temporally Aligned Audio for Video with Autoregression
September 20, 2024
作者: Ilpo Viertola, Vladimir Iashin, Esa Rahtu
cs.AI
摘要
我們介紹了 V-AURA,這是第一個能夠在影音生成中實現高時間對齊和相關性的自回歸模型。V-AURA 使用高幀率的視覺特徵提取器和跨模態音視覺特徵融合策略,以捕捉細粒度的視覺運動事件並確保精確的時間對齊。此外,我們提出了 VisualSound,這是一個具有高音視覺相關性的基準數據集。VisualSound 基於 VGGSound,這是一個包含從 YouTube 提取的野外樣本的視頻數據集。在編輯過程中,我們刪除了聽覺事件與視覺事件不對齊的樣本。V-AURA 在時間對齊和語義相關性方面優於當前最先進的模型,同時保持可比較的音頻質量。代碼、樣本、VisualSound 和模型可在 https://v-aura.notion.site 找到。
English
We introduce V-AURA, the first autoregressive model to achieve high temporal
alignment and relevance in video-to-audio generation. V-AURA uses a
high-framerate visual feature extractor and a cross-modal audio-visual feature
fusion strategy to capture fine-grained visual motion events and ensure precise
temporal alignment. Additionally, we propose VisualSound, a benchmark dataset
with high audio-visual relevance. VisualSound is based on VGGSound, a video
dataset consisting of in-the-wild samples extracted from YouTube. During the
curation, we remove samples where auditory events are not aligned with the
visual ones. V-AURA outperforms current state-of-the-art models in temporal
alignment and semantic relevance while maintaining comparable audio quality.
Code, samples, VisualSound and models are available at
https://v-aura.notion.siteSummary
AI-Generated Summary