驯服多模态联合训练,实现高质量视频到音频合成

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

December 19, 2024
作者: Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji
cs.AI

摘要

我们提出使用一种新颖的多模态联合训练框架 MMAudio,基于视频和可选文本条件来合成高质量且同步的音频。与仅基于(有限的)视频数据进行条件训练相比,MMAudio 与规模更大、易获得的文本-音频数据一起进行联合训练,以学习生成语义对齐的高质量音频样本。此外,我们通过一个条件同步模块改善音频-视觉同步性,该模块在帧级别将视频条件与音频潜变量进行对齐。通过流匹配目标进行训练,MMAudio 在音频质量、语义对齐和音频-视觉同步方面在公开模型中实现了新的视频到音频的最先进水平,同时具有较低的推理时间(生成 8 秒片段仅需 1.23 秒)和仅 157M 参数。MMAudio 在文本到音频生成方面也表现出惊人的竞争力,表明联合训练不会妨碍单模态性能。代码和演示请访问:https://hkchengrex.github.io/MMAudio
English
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

Summary

AI-Generated Summary

PDF182December 23, 2024