SIFT-50M:面向语音指令微调的大规模多语言数据集
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
April 12, 2025
作者: Prabhat Pandey, Rupak Vignesh Swaminathan, K V Vijay Girish, Arunasish Sen, Jian Xie, Grant P. Strimel, Andreas Schwarz
cs.AI
摘要
我们推出了SIFT(语音指令微调)数据集,这是一个包含5000万样本的数据集,专为语音-文本大语言模型(LLMs)的指令微调与预训练而设计。SIFT-50M基于公开可用的语音语料库构建,这些语料库总计包含14,000小时的语音数据,并利用了大语言模型及现成的专家模型。该数据集覆盖五种语言,囊括了多样化的语音理解任务以及可控的语音生成指令。借助SIFT-50M,我们训练了SIFT-LLM模型,该模型在指令跟随基准测试中超越了现有的语音-文本大语言模型,同时在基础语音任务上展现了竞争力。为了促进进一步研究,我们还引入了EvalSIFT,这是一个专门用于评估语音-文本大语言模型指令跟随能力的基准数据集。
English
We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset
designed for instruction fine-tuning and pre-training of speech-text large
language models (LLMs). SIFT-50M is built from publicly available speech
corpora, which collectively contain 14K hours of speech, and leverages LLMs
along with off-the-shelf expert models. The dataset spans five languages,
encompassing a diverse range of speech understanding as well as controllable
speech generation instructions. Using SIFT-50M, we train SIFT-LLM, which
outperforms existing speech-text LLMs on instruction-following benchmarks while
achieving competitive performance on foundational speech tasks. To support
further research, we also introduce EvalSIFT, a benchmark dataset specifically
designed to evaluate the instruction-following capabilities of speech-text
LLMs.Summary
AI-Generated Summary