Emilia:用于语音生成的大规模、广泛、多语言和多样化数据集
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
January 27, 2025
作者: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu
cs.AI
摘要
最近语音生成领域的进展主要受益于大规模训练数据集。然而,由于当前模型依赖于有限于正式朗读风格的有声书数据集,因此无法捕捉真实世界人类语音中固有的自发性和变化性。为了弥补这一差距,我们引入了Emilia-Pipe,这是一个开源预处理管道,可以从有价值但鲜为人知的野外数据中提取高质量的训练数据,这些数据捕捉了真实世界环境中的自发人类语音。通过利用Emilia-Pipe,我们构建了Emilia,这是第一个从野外语音数据中衍生出的多语种语音生成数据集。该数据集涵盖了英语、中文、德语、法语、日语和韩语六种语言的超过101,000小时语音。此外,我们将Emilia扩展为Emilia-Large,这是一个超过216,000小时的数据集,使其成为目前最大的开源语音生成数据集。大量实验证明,Emilia在生成自发和人类化语音方面明显优于传统的有声书数据集,展示了在捕捉真实世界人类语音的多样说话人音色和说话风格方面的卓越表现。此外,这项工作强调了通过扩大数据集规模来推动语音生成研究的重要性,并验证了Emilia在多语种和跨语种语音生成方面的有效性。
English
Recent advancements in speech generation have been driven by the large-scale
training datasets. However, current models fall short of capturing the
spontaneity and variability inherent in real-world human speech, due to their
reliance on audiobook datasets limited to formal read-aloud speech styles. To
bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing
pipeline to extract high-quality training data from valuable yet underexplored
in-the-wild data that capture spontaneous human speech in real-world contexts.
By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech
generation dataset derived from in-the-wild speech data. This dataset comprises
over 101k hours of speech across six languages: English, Chinese, German,
French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a
dataset exceeding 216k hours, making it the largest open-source speech
generation dataset available. Extensive experiments demonstrate that Emilia
significantly outperforms traditional audiobook datasets in generating
spontaneous and human-like speech, showcasing superior performance in capturing
diverse speaker timbre and speaking styles of real-world human speech.
Furthermore, this work underscores the importance of scaling dataset size to
advance speech generation research and validates the effectiveness of Emilia
for both multilingual and crosslingual speech generation.Summary
AI-Generated Summary