명시적 다리와 검색 보강을 활용한 다중 모달 음악 생성

초록

다중 모달 음악 생성은 텍스트, 비디오 및 이미지와 같은 다양한 입력 모달리티에서 음악을 생성하는 것을 목표로 합니다. 기존 방법은 다중 모달 융합을 위해 공통 임베딩 공간을 사용합니다. 다른 모달리티에서의 효과적인 사용에도 불구하고, 다중 모달 음악 생성에 대한 그들의 적용은 데이터 부족, 약한 교차 모달 정렬 및 제한된 조절 가능성과 같은 도전에 직면합니다. 본 논문은 텍스트와 음악의 명시적 다리를 사용하여 다중 모달 정렬 문제를 해결합니다. 우리는 Visuals Music Bridge (VMB)라는 새로운 방법을 소개합니다. 구체적으로, 다중 모달 음악 설명 모델은 시각적 입력을 자세한 텍스트 설명으로 변환하여 텍스트 다리를 제공합니다. 넓고 특정한 검색 전략을 결합하는 이중 트랙 음악 검색 모듈은 음악 다리를 제공하고 사용자 제어를 가능하게 합니다. 마지막으로, 우리는 두 다리를 기반으로 음악을 생성하기 위한 명시적으로 조건이 부여된 음악 생성 프레임워크를 설계합니다. 우리는 비디오-음악, 이미지-음악, 텍스트-음악 및 조절 가능한 음악 생성 작업에 대한 실험을 수행하며, 조절 가능성에 대한 실험도 진행합니다. 결과는 VMB가 이전 방법과 비교하여 음악 품질, 모달리티 및 맞춤 정렬을 현저히 향상시킨다는 것을 보여줍니다. VMB는 다양한 멀티미디어 분야에서 응용 가능한 해석 가능하고 표현력 있는 다중 모달 음악 생성의 새로운 표준을 제시합니다. 데모 및 코드는 https://github.com/wbs2788/VMB에서 확인할 수 있습니다.

English

Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses these issues by using explicit bridges of text and music for multimodal alignment. We introduce a novel method named Visuals Music Bridge (VMB). Specifically, a Multimodal Music Description Model converts visual inputs into detailed textual descriptions to provide the text bridge; a Dual-track Music Retrieval module that combines broad and targeted retrieval strategies to provide the music bridge and enable user control. Finally, we design an Explicitly Conditioned Music Generation framework to generate music based on the two bridges. We conduct experiments on video-to-music, image-to-music, text-to-music, and controllable music generation tasks, along with experiments on controllability. The results demonstrate that VMB significantly enhances music quality, modality, and customization alignment compared to previous methods. VMB sets a new standard for interpretable and expressive multimodal music generation with applications in various multimedia fields. Demos and code are available at https://github.com/wbs2788/VMB.

명시적 다리와 검색 보강을 활용한 다중 모달 음악 생성

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

초록

Support