ChatPaper.aiChatPaper

自我監督式音視覺音景風格化

Self-Supervised Audio-Visual Soundscape Stylization

September 22, 2024
作者: Tingle Li, Renhao Wang, Po-Yao Huang, Andrew Owens, Gopala Anumanchipalli
cs.AI

摘要

語音聲音傳達了大量有關場景的信息,導致各種效果,從混響到額外的環境聲音。在本文中,我們通過操縱輸入語音,使其聽起來彷彿是在不同場景中錄製的,給定了從該場景中錄製的音視條件示例。我們的模型通過自我監督學習,利用自然視頻包含重複出現的聲音事件和紋理的事實。我們從視頻中提取音頻片段並應用語音增強。然後,我們訓練一個潛在擴散模型來恢復原始語音,使用另一個從視頻中的其他位置取出的音視頻片段作為條件提示。通過這個過程,模型學會將條件示例的聲音特性轉移到輸入語音。我們展示了我們的模型可以成功地使用未標記的野外視頻進行訓練,並且額外的視覺信號可以提高其聲音預測能力。請查看我們的項目網頁以獲取視頻結果:https://tinglok.netlify.app/files/avsoundscape/
English
Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example's sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities. Please see our project webpage for video results: https://tinglok.netlify.app/files/avsoundscape/

Summary

AI-Generated Summary

PDF22November 16, 2024