使用多模控制的視訊引導下的佛利聲音生成
Video-Guided Foley Sound Generation with Multimodal Controls
November 26, 2024
作者: Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon
cs.AI
摘要
為影片生成音效通常需要創作藝術性音效,這些音效與現實生活來源有顯著差異,並需要在音效設計中具有靈活控制。為解決這個問題,我們介紹了MultiFoley,這是一個針對影片導向音效生成而設計的模型,支援通過文本、音訊和影片進行多模態條件設定。給定一段無聲影片和一個文本提示,MultiFoley允許用戶創建乾淨的音效(例如,滑板輪轉動時沒有風噪音)或更為奇幻的音效(例如,讓獅子的吼聲聽起來像貓的喵喵聲)。MultiFoley還允許用戶從音效庫或部分影片中選擇參考音訊進行條件設定。我們模型的一個關鍵創新之處在於它在互聯網視頻數據集和專業音效錄製上進行聯合訓練,實現高質量、全頻帶(48kHz)音頻生成。通過自動化評估和人類研究,我們展示了MultiFoley成功生成了在不同條件輸入下同步高質量音效,並且優於現有方法。請查看我們的項目頁面以獲取影片結果:https://ificl.github.io/MultiFoley/
English
Generating sound effects for videos often requires creating artistic sound
effects that diverge significantly from real-life sources and flexible control
in the sound design. To address this problem, we introduce MultiFoley, a model
designed for video-guided sound generation that supports multimodal
conditioning through text, audio, and video. Given a silent video and a text
prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels
spinning without wind noise) or more whimsical sounds (e.g., making a lion's
roar sound like a cat's meow). MultiFoley also allows users to choose reference
audio from sound effects (SFX) libraries or partial videos for conditioning. A
key novelty of our model lies in its joint training on both internet video
datasets with low-quality audio and professional SFX recordings, enabling
high-quality, full-bandwidth (48kHz) audio generation. Through automated
evaluations and human studies, we demonstrate that MultiFoley successfully
generates synchronized high-quality sounds across varied conditional inputs and
outperforms existing methods. Please see our project page for video results:
https://ificl.github.io/MultiFoley/Summary
AI-Generated Summary