Audio Flamingo 2：具备长音频理解与专家推理能力的音频-语言模型

摘要

理解与推理非语音声音及音乐，对于人类与AI智能体有效与环境互动至关重要。本文介绍Audio Flamingo 2（AF2），一款具备高级音频理解与推理能力的音频-语言模型（ALM）。AF2集成了三大核心要素：(i) 定制的CLAP模型，(ii) 用于细粒度音频推理的合成音频问答数据，以及(iii) 多阶段课程学习策略。凭借仅3B参数的小型语言模型，AF2在超过20项基准测试中表现卓越，超越了大型开源及专有模型。此外，我们首次将音频理解能力扩展至长音频片段（30秒至5分钟），并推出LongAudio，一个专为训练ALM在长音频字幕生成与问答任务上而设计的大规模新颖数据集。通过在LongAudio上微调AF2，其在LongAudioBench——一个专家标注的用于评估ALM长音频理解能力的基准测试中，展现了卓越性能。我们进行了广泛的消融研究，以验证所提方法的有效性。项目网站：https://research.nvidia.com/labs/adlr/AF2/。

English

Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach. Project Website: https://research.nvidia.com/labs/adlr/AF2/.

Audio Flamingo 2：具备长音频理解与专家推理能力的音频-语言模型

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

摘要

Summary

Support

Support