ReCLAP: 소리 설명을 통해 제로샷 오디오 분류 개선

초록

오픈 어휘 오디오 언어 모델인 CLAP과 같은 모델은 자연어 프롬프트로 지정된 임의의 카테고리로 분류를 가능하게 함으로써 제로샷 오디오 분류(ZSAC)에 유망한 접근 방식을 제공합니다. 본 논문에서는 CLAP를 활용하여 ZSAC를 개선하기 위한 간단하면서도 효과적인 방법을 제안합니다. 구체적으로, 우리는 추상적인 카테고리 레이블(예: 오르간 소리)을 사용하는 기존 방법에서 벗어나, 다양한 맥락에서 고유한 기술적 특징을 활용하여 소리를 설명하는 프롬프트(예: 오르간의 깊고 공명하는 음조가 대성당을 가득 채웠다.)를 사용합니다. 이를 위해, 우리는 먼저 ReCLAP를 제안합니다. ReCLAP는 야생에서 소리를 더 잘 이해하기 위해 재작성된 오디오 캡션으로 훈련된 CLAP 모델입니다. 이러한 재작성된 캡션은 각 소리 이벤트를 고유한 식별 특성을 사용하여 원래의 캡션에서 설명합니다. ReCLAP는 멀티모달 오디오-텍스트 검색 및 ZSAC 모두에서 모든 베이스라인을 능가합니다. 그 다음, ReCLAP를 사용하여 제로샷 오디오 분류를 개선하기 위해 프롬프트 증강을 제안합니다. 데이터셋의 각 고유한 레이블에 대해 사용자 정의 프롬프트를 생성하는 기존의 손으로 작성된 템플릿 프롬프트 방식과 대조적으로, 이러한 사용자 정의 프롬프트는 먼저 레이블의 소리 이벤트를 설명한 후 다양한 장면에서 활용합니다. 우리의 제안된 방법은 ZSAC에서 ReCLAP의 성능을 1%-18% 향상시키며, 모든 베이스라인을 1% - 55% 능가합니다.

English

Open-vocabulary audio-language models, like CLAP, offer a promising approach for zero-shot audio classification (ZSAC) by enabling classification with any arbitrary set of categories specified with natural language prompts. In this paper, we propose a simple but effective method to improve ZSAC with CLAP. Specifically, we shift from the conventional method of using prompts with abstract category labels (e.g., Sound of an organ) to prompts that describe sounds using their inherent descriptive features in a diverse context (e.g.,The organ's deep and resonant tones filled the cathedral.). To achieve this, we first propose ReCLAP, a CLAP model trained with rewritten audio captions for improved understanding of sounds in the wild. These rewritten captions describe each sound event in the original caption using their unique discriminative characteristics. ReCLAP outperforms all baselines on both multi-modal audio-text retrieval and ZSAC. Next, to improve zero-shot audio classification with ReCLAP, we propose prompt augmentation. In contrast to the traditional method of employing hand-written template prompts, we generate custom prompts for each unique label in the dataset. These custom prompts first describe the sound event in the label and then employ them in diverse scenes. Our proposed method improves ReCLAP's performance on ZSAC by 1%-18% and outperforms all baselines by 1% - 55%.

ReCLAP: 소리 설명을 통해 제로샷 오디오 분류 개선

ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds

초록

Summary

Support

Support