馴服多模態聯合訓練以實現高質量的視訊轉音訊合成

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

December 19, 2024
作者: Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji
cs.AI

摘要

我們提出使用新穎的多模態聯合訓練框架 MMAudio,根據視頻和可選文本條件來合成高質量且同步的音頻。與僅依賴(有限)視頻數據的單模態訓練相比,MMAudio 與更大規模、readily 可用的文本-音頻數據一起進行聯合訓練,以學習生成語義對齊的高質量音頻樣本。此外,我們通過一個條件同步模塊來改進音頻-視覺同步性,該模塊在幀級別上將視頻條件與音頻潛在表示進行對齊。通過流匹配目標進行訓練,MMAudio 在音頻質量、語義對齊和音頻-視覺同步方面在公共模型中實現了新的視頻到音頻的最新技術水平,同時具有低推理時間(生成 8 秒片段僅需 1.23 秒)和僅 157M 參數。MMAudio 在文本到音頻生成中也實現了驚人的競爭性表現,表明聯合訓練不會阻礙單模態性能。代碼和演示可在以下網址找到:https://hkchengrex.github.io/MMAudio
English
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

Summary

AI-Generated Summary

PDF182December 23, 2024