使用多模控制进行视频引导的弗利音效生成
Video-Guided Foley Sound Generation with Multimodal Controls
November 26, 2024
作者: Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon
cs.AI
摘要
为视频生成音效通常需要创作艺术性音效,这些音效与现实来源有明显不同,并需要在声音设计中灵活控制。为解决这一问题,我们引入了MultiFoley,这是一个专为视频引导声音生成而设计的模型,通过文本、音频和视频支持多模态调节。给定一个无声视频和一个文本提示,MultiFoley允许用户创建清晰的声音(例如,滑板车轮旋转时没有风声)或更异想天开的声音(例如,让狮子的吼声听起来像猫的喵喵声)。MultiFoley还允许用户从声音效果(SFX)库或部分视频中选择参考音频进行调节。我们模型的一个关键创新之处在于它在互联网视频数据集和专业SFX录音上进行联合训练,实现了高质量、全频带(48kHz)音频生成。通过自动评估和人类研究,我们展示了MultiFoley成功地生成了同步高质量声音,跨越各种条件输入,并且优于现有方法。请查看我们的项目页面以获取视频结果:https://ificl.github.io/MultiFoley/
English
Generating sound effects for videos often requires creating artistic sound
effects that diverge significantly from real-life sources and flexible control
in the sound design. To address this problem, we introduce MultiFoley, a model
designed for video-guided sound generation that supports multimodal
conditioning through text, audio, and video. Given a silent video and a text
prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels
spinning without wind noise) or more whimsical sounds (e.g., making a lion's
roar sound like a cat's meow). MultiFoley also allows users to choose reference
audio from sound effects (SFX) libraries or partial videos for conditioning. A
key novelty of our model lies in its joint training on both internet video
datasets with low-quality audio and professional SFX recordings, enabling
high-quality, full-bandwidth (48kHz) audio generation. Through automated
evaluations and human studies, we demonstrate that MultiFoley successfully
generates synchronized high-quality sounds across varied conditional inputs and
outperforms existing methods. Please see our project page for video results:
https://ificl.github.io/MultiFoley/Summary
AI-Generated Summary