SonicSim:用於移動聲源情境中語音處理的可定制模擬平台
SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios
October 2, 2024
作者: Kai Li, Wendi Sang, Chang Zeng, Runxuan Yang, Guo Chen, Xiaolin Hu
cs.AI
摘要
在移動聲源條件下系統評估語音分離和增強模型通常需要包含各種情境的大量數據。然而,現實世界的數據集通常包含的數據不足以滿足模型的訓練和評估需求。儘管合成數據集提供了更大量的數據,但它們的聲學模擬缺乏真實感。因此,現實世界和合成數據集都無法有效滿足實際需求。為了應對這些問題,我們介紹了SonicSim,一個合成工具包,旨在為移動聲源生成高度可定制的數據。SonicSim基於具有多級調整功能的具體化AI模擬平台Habitat-sim開發,包括場景級、麥克風級和聲源級,從而生成更多樣化的合成數據。利用SonicSim,我們構建了一個移動聲源基準數據集SonicSet,使用了Librispeech、Freesound數據集50k(FSD50K)和Free Music Archive(FMA),以及Matterport3D的90個場景來評估語音分離和增強模型。此外,為了驗證合成數據與現實世界數據之間的差異,我們隨機選取了SonicSet驗證集中5小時無混響的原始數據,錄製了一個現實世界的語音分離數據集,然後與相應的合成數據集進行比較。同樣地,我們利用了現實世界的語音增強數據集RealMAN來驗證其他合成數據集和SonicSet數據集之間的聲學差距。結果表明,SonicSim生成的合成數據能夠有效地推廣到現實世界的情境。演示和代碼可在https://cslikai.cn/SonicSim/公開獲取。
English
The systematic evaluation of speech separation and enhancement models under
moving sound source conditions typically requires extensive data comprising
diverse scenarios. However, real-world datasets often contain insufficient data
to meet the training and evaluation requirements of models. Although synthetic
datasets offer a larger volume of data, their acoustic simulations lack
realism. Consequently, neither real-world nor synthetic datasets effectively
fulfill practical needs. To address these issues, we introduce SonicSim, a
synthetic toolkit de-designed to generate highly customizable data for moving
sound sources. SonicSim is developed based on the embodied AI simulation
platform, Habitat-sim, supporting multi-level adjustments, including
scene-level, microphone-level, and source-level, thereby generating more
diverse synthetic data. Leveraging SonicSim, we constructed a moving sound
source benchmark dataset, SonicSet, using the Librispeech, the Freesound
Dataset 50k (FSD50K) and Free Music Archive (FMA), and 90 scenes from the
Matterport3D to evaluate speech separation and enhancement models.
Additionally, to validate the differences between synthetic data and real-world
data, we randomly selected 5 hours of raw data without reverberation from the
SonicSet validation set to record a real-world speech separation dataset, which
was then compared with the corresponding synthetic datasets. Similarly, we
utilized the real-world speech enhancement dataset RealMAN to validate the
acoustic gap between other synthetic datasets and the SonicSet dataset for
speech enhancement. The results indicate that the synthetic data generated by
SonicSim can effectively generalize to real-world scenarios. Demo and code are
publicly available at https://cslikai.cn/SonicSim/.Summary
AI-Generated Summary