警惕差距!大型音频模型的静态与交互式评估
Mind the Gap! Static and Interactive Evaluations of Large Audio Models
February 21, 2025
作者: Minzhi Li, William Barr Held, Michael J Ryan, Kunat Pipatanakul, Potsawee Manakul, Hao Zhu, Diyi Yang
cs.AI
摘要
随着AI聊天机器人日益普及,语音交互为快速、高带宽的语义及社交信号传递提供了一种引人注目的方式。这推动了大型音频模型(LAMs)的研究,以支撑原生语音体验的发展。然而,要使LAM的开发与用户目标保持一致,必须清晰理解用户需求与偏好,从而建立可靠的进展评估指标。本研究通过引入一种交互式方法来评估LAM,并从484名参与者中收集了7,500次LAM交互数据,以应对这些挑战。通过对用户查询的主题建模,我们识别出了音频界面的主要应用场景。随后,我们分析用户偏好排序及定性反馈,以确定哪些模型最符合用户需求。最后,我们评估静态基准测试对交互性能的预测能力——分析显示,没有任何单一基准测试与交互结果有强相关性(所有基准测试的τ≤0.33)。虽然结合多个粗粒度特征能带来一定的预测能力(R²=0.30),但在二十个关于口语问答和年龄预测的数据集中,仅有两个显示出显著的正相关。这表明,迫切需要开发与用户偏好更紧密关联的LAM评估方法。
English
As AI chatbots become ubiquitous, voice interaction presents a compelling way
to enable rapid, high-bandwidth communication for both semantic and social
signals. This has driven research into Large Audio Models (LAMs) to power
voice-native experiences. However, aligning LAM development with user goals
requires a clear understanding of user needs and preferences to establish
reliable progress metrics. This study addresses these challenges by introducing
an interactive approach to evaluate LAMs and collecting 7,500 LAM interactions
from 484 participants. Through topic modeling of user queries, we identify
primary use cases for audio interfaces. We then analyze user preference
rankings and qualitative feedback to determine which models best align with
user needs. Finally, we evaluate how static benchmarks predict interactive
performance - our analysis reveals no individual benchmark strongly correlates
with interactive results (tau leq 0.33 for all benchmarks). While combining
multiple coarse-grained features yields modest predictive power (R^2=0.30),
only two out of twenty datasets on spoken question answering and age prediction
show significantly positive correlations. This suggests a clear need to develop
LAM evaluations that better correlate with user preferences.Summary
AI-Generated Summary