Emilia：用于语音生成的大规模、广泛、多语言和多样化数据集

摘要

最近语音生成领域的进展主要受益于大规模训练数据集。然而，由于当前模型依赖于有限于正式朗读风格的有声书数据集，因此无法捕捉真实世界人类语音中固有的自发性和变化性。为了弥补这一差距，我们引入了Emilia-Pipe，这是一个开源预处理管道，可以从有价值但鲜为人知的野外数据中提取高质量的训练数据，这些数据捕捉了真实世界环境中的自发人类语音。通过利用Emilia-Pipe，我们构建了Emilia，这是第一个从野外语音数据中衍生出的多语种语音生成数据集。该数据集涵盖了英语、中文、德语、法语、日语和韩语六种语言的超过101,000小时语音。此外，我们将Emilia扩展为Emilia-Large，这是一个超过216,000小时的数据集，使其成为目前最大的开源语音生成数据集。大量实验证明，Emilia在生成自发和人类化语音方面明显优于传统的有声书数据集，展示了在捕捉真实世界人类语音的多样说话人音色和说话风格方面的卓越表现。此外，这项工作强调了通过扩大数据集规模来推动语音生成研究的重要性，并验证了Emilia在多语种和跨语种语音生成方面的有效性。

English

Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.

Emilia：用于语音生成的大规模、广泛、多语言和多样化数据集

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

摘要

Summary

Support