CLAIR-A: 대형 언어 모델을 활용하여 오디오 캡션을 판단하기

초록

자동 음성 자막(Automated Audio Captioning, AAC) 작업은 모델이 음성 입력의 자연어 설명을 생성하도록 요청합니다. 이러한 기계 생성 음성 자막을 평가하는 것은 청각 장면 이해, 소리-객체 추론, 시간적 일관성, 그리고 장면의 환경적 맥락을 고려해야 하는 복잡한 작업입니다. 현재 방법들은 특정 측면에 집중하지만 종종 인간 판단과 잘 일치하는 전반적인 점수를 제공하지 못하는 경우가 있습니다. 본 연구에서는 대규모 언어 모델(Large Language Models, LLMs)의 제로샷 능력을 활용하여 후보 음성 자막을 평가하기 위해 LLMs에 직접 의미적 거리 점수를 요청하는 간단하고 유연한 CLAIR-A 방법을 제안합니다. 우리의 평가에서 CLAIR-A는 전통적인 메트릭인 도메인 특정 FENSE 메트릭 대비 인간 판단의 품질을 더 잘 예측하며, Clotho-Eval 데이터셋에서 일반적인 측정 방법 중 최고의 측정 방법에 비해 최대 11%까지 상대적 정확도 향상을 보입니다. 게다가, CLAIR-A는 언어 모델이 점수의 근거를 설명할 수 있도록 함으로써 더 많은 투명성을 제공하며, 이러한 설명은 기준 방법에서 제공하는 것보다 인간 평가자들에 의해 최대 30% 더 잘 평가됩니다. CLAIR-A는 https://github.com/DavidMChan/clair-a에서 공개되어 있습니다.

English

The Automated Audio Captioning (AAC) task asks models to generate natural language descriptions of an audio input. Evaluating these machine-generated audio captions is a complex task that requires considering diverse factors, among them, auditory scene understanding, sound-object inference, temporal coherence, and the environmental context of the scene. While current methods focus on specific aspects, they often fail to provide an overall score that aligns well with human judgment. In this work, we propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models (LLMs) to evaluate candidate audio captions by directly asking LLMs for a semantic distance score. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics, with a 5.8% relative accuracy improvement compared to the domain-specific FENSE metric and up to 11% over the best general-purpose measure on the Clotho-Eval dataset. Moreover, CLAIR-A offers more transparency by allowing the language model to explain the reasoning behind its scores, with these explanations rated up to 30% better by human evaluators than those provided by baseline methods. CLAIR-A is made publicly available at https://github.com/DavidMChan/clair-a.

CLAIR-A: 대형 언어 모델을 활용하여 오디오 캡션을 판단하기

CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

초록

Summary

Support

Support