MuVi:具有語義對齊和節奏同步的視頻轉音樂生成
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization
October 16, 2024
作者: Ruiqi Li, Siqi Zheng, Xize Cheng, Ziang Zhang, Shengpeng Ji, Zhou Zhao
cs.AI
摘要
生成與視頻的視覺內容相符的音樂一直是一項具有挑戰性的任務,因為這需要對視覺語義有深入的理解,並涉及生成旋律、節奏和動態與視覺敘事和諧的音樂。本文提出了MuVi,一個新穎的框架,有效應對這些挑戰,以增強音視頻內容的連貫性和沉浸式體驗。MuVi通過一個特別設計的視覺適配器分析視頻內容,以提取具有上下文和時間相關性的特徵。這些特徵用於生成不僅與視頻的情緒和主題相匹配,還與其節奏和節奏相協調的音樂。我們還引入了對比的音樂-視覺預訓練方案,以確保同步,基於音樂短語的周期性特性。此外,我們展示了我們基於流匹配的音樂生成器具有上下文學習能力,使我們能夠控制生成音樂的風格和流派。實驗結果表明,MuVi在音頻質量和時間同步方面表現出優異的性能。生成的音樂視頻樣本可在https://muvi-v2m.github.io上找到。
English
Generating music that aligns with the visual content of a video has been a
challenging task, as it requires a deep understanding of visual semantics and
involves generating music whose melody, rhythm, and dynamics harmonize with the
visual narratives. This paper presents MuVi, a novel framework that effectively
addresses these challenges to enhance the cohesion and immersive experience of
audio-visual content. MuVi analyzes video content through a specially designed
visual adaptor to extract contextually and temporally relevant features. These
features are used to generate music that not only matches the video's mood and
theme but also its rhythm and pacing. We also introduce a contrastive
music-visual pre-training scheme to ensure synchronization, based on the
periodicity nature of music phrases. In addition, we demonstrate that our
flow-matching-based music generator has in-context learning ability, allowing
us to control the style and genre of the generated music. Experimental results
show that MuVi demonstrates superior performance in both audio quality and
temporal synchronization. The generated music video samples are available at
https://muvi-v2m.github.io.Summary
AI-Generated Summary