ChatPaper.aiChatPaper

Audio-FLAN:初步发布版

Audio-FLAN: A Preliminary Release

February 23, 2025
作者: Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue
cs.AI

摘要

近期音频标记化技术的显著进步极大地促进了音频能力与大型语言模型(LLMs)的融合。然而,音频理解与生成常被视为独立任务,这阻碍了真正统一的音频-语言模型的发展。尽管指令微调在提升文本与视觉领域的泛化能力和零样本学习方面已展现出显著成效,但其在音频领域的应用仍鲜有探索。一个主要障碍是缺乏整合音频理解与生成的综合性数据集。为此,我们推出了Audio-FLAN,这是一个大规模指令微调数据集,涵盖了语音、音乐及声音三大领域的80种多样化任务,实例数量超过一亿。Audio-FLAN为零样本方式下跨广泛音频领域无缝处理理解(如转录、理解)与生成(如语音、音乐、声音)任务的统一音频-语言模型奠定了基础。Audio-FLAN数据集已在HuggingFace和GitHub上发布,并将持续更新。
English
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.

Summary

AI-Generated Summary

PDF342February 25, 2025