유란-미니: 개방형 데이터 효율적인 언어 모델

초록

대규모 언어 모델(LLMs)의 효과적인 사전 훈련은 엄청난 자원 요구와 기술적 과정의 복잡성으로 어려움을 겪어왔습니다. 본 논문은 2.42B 개의 매개변수를 갖춘 높은 성능을 발휘하는 기본 모델인 YuLan-Mini에 대한 자세한 기술 보고서를 제시합니다. 저희의 사전 훈련 접근 방식은 훈련 효율성을 향상시키기 위해 세 가지 핵심 기술 기여에 초점을 맞춥니다: 정교한 데이터 파이프라인은 데이터 정리와 데이터 일정 전략을 결합하며, 훈련 불안정성을 완화하기 위한 견고한 최적화 방법, 그리고 목표 지향적 데이터 선택과 긴 문맥 훈련을 통합한 효과적인 어닐링 접근 방식을 포함합니다. 놀랍게도, 1.08T 토큰으로 훈련된 YuLan-Mini는 훨씬 더 많은 데이터가 필요한 산업 선도 모델과 유사한 성능을 달성합니다. 재현을 용이하게 하기 위해 각 훈련 단계의 데이터 구성에 대한 모든 세부 정보를 공개합니다. 프로젝트 세부 정보는 다음 링크에서 확인할 수 있습니다: https://github.com/RUC-GSAI/YuLan-Mini.

English

Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.

유란-미니: 개방형 데이터 효율적인 언어 모델

YuLan-Mini: An Open Data-efficient Language Model

초록

Support