CLAIR-A:利用大型語言模型評估音訊標題
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions
September 19, 2024
作者: Tsung-Han Wu, Joseph E. Gonzalez, Trevor Darrell, David M. Chan
cs.AI
摘要
自動音頻字幕(AAC)任務要求模型生成音頻輸入的自然語言描述。評估這些機器生成的音頻字幕是一項複雜任務,需要考慮多種因素,其中包括聽覺場景理解、聲音對象推斷、時間連貫性和場景的環境背景。儘管當前方法專注於特定方面,但它們通常無法提供與人類判斷良好一致的總體得分。在這項工作中,我們提出了CLAIR-A,一種簡單靈活的方法,利用大型語言模型(LLMs)的零樣本能力,通過直接向LLMs請求語義距離分數來評估候選音頻字幕。在我們的評估中,CLAIR-A相對於領域特定的FENSE指標,比傳統指標更好地預測了人類對質量的判斷,相對於Clotho-Eval數據集上最佳通用度量,準確度提高了5.8%,最高達11%。此外,CLAIR-A通過允許語言模型解釋其分數背後的推理,提供了更多透明度,這些解釋被人類評估者評分比基準方法提供的好30%。CLAIR-A已公開在https://github.com/DavidMChan/clair-a。
English
The Automated Audio Captioning (AAC) task asks models to generate natural
language descriptions of an audio input. Evaluating these machine-generated
audio captions is a complex task that requires considering diverse factors,
among them, auditory scene understanding, sound-object inference, temporal
coherence, and the environmental context of the scene. While current methods
focus on specific aspects, they often fail to provide an overall score that
aligns well with human judgment. In this work, we propose CLAIR-A, a simple and
flexible method that leverages the zero-shot capabilities of large language
models (LLMs) to evaluate candidate audio captions by directly asking LLMs for
a semantic distance score. In our evaluations, CLAIR-A better predicts human
judgements of quality compared to traditional metrics, with a 5.8% relative
accuracy improvement compared to the domain-specific FENSE metric and up to 11%
over the best general-purpose measure on the Clotho-Eval dataset. Moreover,
CLAIR-A offers more transparency by allowing the language model to explain the
reasoning behind its scores, with these explanations rated up to 30% better by
human evaluators than those provided by baseline methods. CLAIR-A is made
publicly available at https://github.com/DavidMChan/clair-a.Summary
AI-Generated Summary