MuVi: 시멘틱 정렬과 리듬 동기화를 이용한 비디오에서 음악 생성

초록

비디오의 시각적 콘텐츠와 일치하는 음악을 생성하는 것은 시각 의미론을 심층적으로 이해하고 멜로디, 리듬, 그리고 역학이 시각 서술과 조화롭게 어우러지는 음악을 생성하는 것을 필요로 하기 때문에 어려운 과제였습니다. 본 논문은 MuVi라는 혁신적인 프레임워크를 제시하여 이러한 과제를 효과적으로 해결하여 오디오-비주얼 콘텐츠의 일관성과 몰입 경험을 향상시킵니다. MuVi는 비디오 콘텐츠를 분석하기 위해 특별히 설계된 시각 어댑터를 통해 맥락적으로 그리고 시간적으로 관련성 있는 특징을 추출합니다. 이러한 특징은 비디오의 분위기와 주제 뿐만 아니라 리듬과 페이싱과도 일치하는 음악을 생성하는 데 사용됩니다. 또한 음악 구절의 주기성 특성을 기반으로 동기화를 보장하기 위해 대조적인 음악-비주얼 사전 훈련 체계를 소개합니다. 게다가, 흐름 일치 기반 음악 생성기가 문맥 내 학습 능력을 갖추어 생성된 음악의 스타일과 장르를 제어할 수 있음을 보여줍니다. 실험 결과는 MuVi가 오디오 품질과 시간적 동기화 모두에서 우수한 성능을 보여주는 것을 보여줍니다. 생성된 음악 비디오 샘플은 https://muvi-v2m.github.io에서 확인할 수 있습니다.

English

Generating music that aligns with the visual content of a video has been a challenging task, as it requires a deep understanding of visual semantics and involves generating music whose melody, rhythm, and dynamics harmonize with the visual narratives. This paper presents MuVi, a novel framework that effectively addresses these challenges to enhance the cohesion and immersive experience of audio-visual content. MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features. These features are used to generate music that not only matches the video's mood and theme but also its rhythm and pacing. We also introduce a contrastive music-visual pre-training scheme to ensure synchronization, based on the periodicity nature of music phrases. In addition, we demonstrate that our flow-matching-based music generator has in-context learning ability, allowing us to control the style and genre of the generated music. Experimental results show that MuVi demonstrates superior performance in both audio quality and temporal synchronization. The generated music video samples are available at https://muvi-v2m.github.io.

MuVi: 시멘틱 정렬과 리듬 동기화를 이용한 비디오에서 음악 생성

MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

초록

Support