MOSEL:950,000 小時語音數據,用於歐盟語言的開源語音基金會模型訓練
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
October 1, 2024
作者: Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri
cs.AI
摘要
隨著基礎模型(FMs)的崛起,以及針對其風險和影響進行監管的努力,開源模型引起了極大的興趣。然而,現有的語音基礎模型(SFMs)並未完全符合開源原則,即使聲稱如此,因為目前沒有任何一個現有的SFMs在開源條款下公開提供模型權重、代碼和訓練數據。在這項工作中,我們著手填補這一空白,專注於歐盟(EU)的24種官方語言。我們通過調查自動語音識別數據集和開源合規許可的未標記語音語料庫,共計收集了950,000小時的適當訓練數據。此外,我們釋出了441,000小時未標記數據的自動轉錄,採用寬鬆的CC-BY許可,從而促進了針對歐盟語言的開源SFMs的創建。
English
The rise of foundation models (FMs), coupled with regulatory efforts
addressing their risks and impacts, has sparked significant interest in
open-source models. However, existing speech FMs (SFMs) fall short of full
compliance with the open-source principles, even if claimed otherwise, as no
existing SFM has model weights, code, and training data publicly available
under open-source terms. In this work, we take the first step toward filling
this gap by focusing on the 24 official languages of the European Union (EU).
We collect suitable training data by surveying automatic speech recognition
datasets and unlabeled speech corpora under open-source compliant licenses, for
a total of 950k hours. Additionally, we release automatic transcripts for 441k
hours of unlabeled data under the permissive CC-BY license, thereby
facilitating the creation of open-source SFMs for the EU languages.Summary
AI-Generated Summary