AAD-LLM：基于神经注意力的听觉场景理解模型

摘要

听觉基础模型，包括听觉大语言模型（LLMs），在处理所有声音输入时一视同仁，与听者的感知无关。然而，人类的听觉感知本质上是选择性的：在复杂的听觉场景中，听者会专注于特定的说话者而忽略其他声音。现有模型未能融入这种选择性，限制了其生成与感知一致响应的能力。为解决这一问题，我们提出了意图感知的听觉场景理解（II-ASU），并展示了听觉注意力驱动的大语言模型（AAD-LLM），这是一个通过整合脑信号来推断听者注意力的原型系统。AAD-LLM通过引入颅内脑电图（iEEG）记录，扩展了听觉LLM，以解码听者正在关注哪位说话者，并据此优化响应。该模型首先从神经活动中预测被关注的说话者，然后基于这一推断的注意力状态来生成响应。我们在多说话者场景下对AAD-LLM进行了说话者描述、语音转录与提取以及问答任务的评估，主客观评分均显示其与听者意图的契合度显著提升。通过迈出意图感知听觉AI的第一步，本研究探索了一种新的范式，即让听者感知指导机器听觉，为未来以听者为中心的听觉系统开辟了道路。演示与代码请访问：https://aad-llm.github.io。

English

Auditory foundation models, including auditory large language models (LLMs), process all sound inputs equally, independent of listener perception. However, human auditory perception is inherently selective: listeners focus on specific speakers while ignoring others in complex auditory scenes. Existing models do not incorporate this selectivity, limiting their ability to generate perception-aligned responses. To address this, we introduce Intention-Informed Auditory Scene Understanding (II-ASU) and present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention. AAD-LLM extends an auditory LLM by incorporating intracranial electroencephalography (iEEG) recordings to decode which speaker a listener is attending to and refine responses accordingly. The model first predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state. We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering in multitalker scenarios, with both objective and subjective ratings showing improved alignment with listener intention. By taking a first step toward intention-aware auditory AI, this work explores a new paradigm where listener perception informs machine listening, paving the way for future listener-centered auditory systems. Demo and code available: https://aad-llm.github.io.

AAD-LLM：基于神经注意力的听觉场景理解模型

AAD-LLM: Neural Attention-Driven Auditory Scene Understanding

摘要

Summary

Support

Support