ChatPaper.aiChatPaper

SoloAudio:針對以語言為導向的音訊擴散Transformer的目標聲音提取

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

September 12, 2024
作者: Helin Wang, Jiarui Hai, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali, Najim Dehak
cs.AI

摘要

本文介紹了SoloAudio,一種新型基於擴散的生成模型,用於目標聲音提取(TSE)。我們的方法在音頻上訓練潛在擴散模型,將先前的U-Net骨幹替換為在潛在特徵上運行的跳接連接的Transformer。SoloAudio通過利用CLAP模型作為目標聲音的特徵提取器,支持音頻導向和語言導向的TSE。此外,SoloAudio利用最先進的文本轉語音模型生成的合成音頻進行訓練,展現對領域外數據和未見過的聲音事件的強大泛化能力。我們在FSD Kaggle 2018混合數據集和來自AudioSet的真實數據上評估了這種方法,SoloAudio在領域內和領域外數據上均取得了最先進的結果,展現了令人印象深刻的零樣本和少樣本能力。源代碼和演示已發布。
English
In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and real data from AudioSet, where SoloAudio achieves the state-of-the-art results on both in-domain and out-of-domain data, and exhibits impressive zero-shot and few-shot capabilities. Source code and demos are released.

Summary

AI-Generated Summary

PDF102November 16, 2024