ChatPaper.aiChatPaper

Synthio:使用合成數據增強小規模音頻分類數據集

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

October 2, 2024
作者: Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha
cs.AI

摘要

我們提出了Synthio,一種新穎的方法,用於通過合成數據來擴充小規模音頻分類數據集。我們的目標是在有限標記數據的情況下提高音頻分類的準確性。傳統的數據擴充技術,通常應用人工轉換(例如添加隨機噪音或遮罩段),往往難以創建捕捉真實世界音頻中真正多樣性的數據。為了解決這個缺陷,我們提議通過從文本到音頻(T2A)擴散模型生成的合成音頻來擴充數據集。然而,合成有效的擴充是具有挑戰性的,因為生成的數據不僅應該在聲學上與基礎小規模數據集保持一致,還應該具有足夠的組成多樣性。為了克服第一個挑戰,我們使用偏好優化來對齊T2A模型的生成與小規模數據集,以確保生成數據的聲學特徵保持與小規模數據集一致。為了應對第二個挑戰,我們提出了一種新穎的標題生成技術,利用大型語言模型的推理能力來(1)生成多樣且有意義的音頻標題,以及(2)迭代地改進其質量。生成的標題然後用於提示對齊的T2A模型。我們在十個數據集和四個模擬有限數據設置上對Synthio進行了廣泛評估。結果表明,我們的方法在僅在弱標題AudioSet上訓練的T2A模型上始終優於所有基準,性能提升0.1%-39%。
English
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.

Summary

AI-Generated Summary

PDF62November 16, 2024