Synthio: 합성 데이터를 활용한 소규모 오디오 분류 데이터셋 보강

초록

우리는 작은 규모의 오디오 분류 데이터셋을 합성 데이터로 보강하는 혁신적인 방법인 Synthio를 제안합니다. 우리의 목표는 레이블이 제한된 데이터를 사용하여 오디오 분류 정확도를 향상시키는 것입니다. 인공적인 변형(예: 임의의 소음 추가 또는 세그먼트 숨김)을 적용하는 전통적인 데이터 증강 기술은 실제 오디오의 다양성을 충분히 포착하는 데이터를 생성하는 데 어려움을 겪습니다. 이러한 결함을 해결하기 위해 우리는 텍스트-오디오(T2A) 확산 모델에서 생성된 합성 오디오로 데이터셋을 보강하는 것을 제안합니다. 그러나 효과적인 보강을 합성하는 것은 어렵습니다. 생성된 데이터가 작은 규모 데이터셋과 음향적으로 일관성을 유지해야 할 뿐만 아니라 충분한 구성 다양성을 가져야하기 때문입니다. 첫 번째 도전을 극복하기 위해 T2A 모델의 생성을 선호 최적화를 사용하여 작은 규모 데이터셋과 일치시킵니다. 이렇게 함으로써 생성된 데이터의 음향적 특성이 작은 규모 데이터셋과 일관성을 유지하도록 보장합니다. 두 번째 도전에 대응하기 위해 우리는 대형 언어 모델의 추론 능력을 활용한 새로운 캡션 생성 기술을 제안합니다. 이를 통해 (1) 다양하고 의미 있는 오디오 캡션을 생성하고 (2) 그 품질을 반복적으로 개선합니다. 생성된 캡션은 일치된 T2A 모델을 프롬프트하는 데 사용됩니다. 우리는 Synthio를 십 가지 데이터셋과 네 가지 시뮬레이션된 제한된 데이터 설정에서 철저하게 평가했습니다. 결과는 우리의 방법이 약한 캡션으로만 훈련된 T2A 모델을 사용하여 모든 기준선을 0.1%-39% 일관되게 능가한다는 것을 나타냅니다.

English

We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.

Synthio: 합성 데이터를 활용한 소규모 오디오 분류 데이터셋 보강

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

초록

Summary

Support

Support