具有明確橋樑和檢索增強的多模式音樂生成

摘要

多模式音樂生成旨在從不同的輸入模態生成音樂，包括文本、視頻和圖像。現有方法使用共同的嵌入空間進行多模式融合。儘管它們在其他模態中有效，但在多模式音樂生成中的應用面臨著數據稀缺、跨模態對齊不足和可控性有限等挑戰。本文通過使用文本和音樂的明確橋樑來解決這些問題。我們提出了一種名為視覺音樂橋樑（VMB）的新方法。具體來說，多模式音樂描述模型將視覺輸入轉換為詳細的文本描述以提供文本橋樑；雙軌音樂檢索模塊結合廣泛和有針對性的檢索策略以提供音樂橋樑並實現用戶控制。最後，我們設計了一個明確條件的音樂生成框架，基於這兩個橋樑生成音樂。我們在視頻到音樂、圖像到音樂、文本到音樂和可控音樂生成任務上進行實驗，以及在可控性方面進行實驗。結果表明，與先前方法相比，VMB顯著提高了音樂質量、模態和定制對齊。VMB為可解釋和具有表現力的多模式音樂生成設定了新的標準，並在各種多媒體領域中具有應用。演示和代碼可在https://github.com/wbs2788/VMB 上找到。

English

Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses these issues by using explicit bridges of text and music for multimodal alignment. We introduce a novel method named Visuals Music Bridge (VMB). Specifically, a Multimodal Music Description Model converts visual inputs into detailed textual descriptions to provide the text bridge; a Dual-track Music Retrieval module that combines broad and targeted retrieval strategies to provide the music bridge and enable user control. Finally, we design an Explicitly Conditioned Music Generation framework to generate music based on the two bridges. We conduct experiments on video-to-music, image-to-music, text-to-music, and controllable music generation tasks, along with experiments on controllability. The results demonstrate that VMB significantly enhances music quality, modality, and customization alignment compared to previous methods. VMB sets a new standard for interpretable and expressive multimodal music generation with applications in various multimedia fields. Demos and code are available at https://github.com/wbs2788/VMB.

具有明確橋樑和檢索增強的多模式音樂生成

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

摘要

Support