Emilia: 음성 생성을 위한 대규모, 광범위하고 다국어로 이루어진 다양한 데이터셋

초록

최근 음성 생성 분야의 발전은 대규모 교육 데이터셋에 의해 주도되었습니다. 그러나 현재 모델들은 공식적인 낭독 양식에 한정된 오디오북 데이터셋에 의존하여 실제 인간의 말에 내재된 즉흥성과 변별성을 충분히 잡아내지 못합니다. 이 간극을 메우기 위해 우리는 Emilia-Pipe를 소개합니다. 이는 실제 세계 맥락에서 즉흥적인 인간의 말을 포착하는 가치 있는 그러나 미개척된 데이터에서 고품질 교육 데이터를 추출하기 위한 오픈 소스 전처리 파이프라인입니다. Emilia-Pipe를 활용하여 우리는 Emilia를 구축했습니다. 이는 실제 세계의 말 데이터에서 파생된 최초의 다국어 음성 생성 데이터셋입니다. 이 데이터셋은 영어, 중국어, 독일어, 프랑스어, 일본어 및 한국어로 구성된 101,000시간 이상의 음성을 포함하고 있습니다. 더불어, 우리는 Emilia를 Emilia-Large로 확장하여 216,000시간을 초과하는 데이터셋으로 만들었습니다. 이는 현재 가장 큰 오픈 소스 음성 생성 데이터셋입니다. 체계적인 실험 결과는 Emilia가 다양한 화자 음색과 실제 세계 인간의 말의 발화 양식을 잡아내는 데 있어 전통적인 오디오북 데이터셋을 크게 능가한다는 것을 명백히 보여주며, 실제 세계 인간의 말의 다양성을 잡아내는 데 우수한 성능을 보여줍니다. 더불어, 이 연구는 음성 생성 연구를 발전시키기 위해 데이터셋 크기를 확장하는 중요성을 강조하고, Emilia가 다국어 및 교차언어 음성 생성에 효과적임을 검증합니다.

English

Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.

Emilia: 음성 생성을 위한 대규모, 광범위하고 다국어로 이루어진 다양한 데이터셋

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

초록

Support